

# **Low-Power Neural Network Accelerators with Custom Floating-Point Computation**

*Dissertation zur Erlangung des akademischen Grades  
Doktor-Ingenieur (Dr.-Ing.) im Fach Elektrotechnik  
und Informationstechnik*

**YARIB NEVAREZ**

1. Gutachter: Prof. Dr. Alberto García-Ortiz  
2. Gutachter: Prof. Dr. X

Eingereicht am: 27.05.2022  
Tag des Promotionskolloquiums: DD.MM.2022

This work is licensed under a Creative Commons “Attribution-NonCommercial-ShareAlike 3.0 Unported” license.



# Acknowledgment

This work is funded by the *Consejo Nacional de Ciencia y Tecnología – CONACYT* (the Mexican National Council for Science and Technology).

I would like to thank Prof. Dr. Alberto García-Ortiz, my Ph.D. supervisor, for his invaluable help in this process. He encouraged me when I underestimated my work and he criticized me when I overestimated it. He knows how to raise students to the level of independent researchers. I would also like to thank Prof. Dr. X for his guidance and for his time to review this work. Moreover, I would like to thank the members of the graduation committee for their review and suggestions. I would like to thank Mexico and Conacyt for financially supporting my PhD. Special thanks to Tovalin, Julian Rosales, Fernando De la Torre, Ulises Ponce, Carlos Cruz, and Kai Müller for their inspiration for teaching.

I would also like to thank my colleagues Ardalan Najafi, Amir Najafi, Wanli Yu, Yanqiu Huang, Robert Schmidt, Yizhi Chen, Jinming Sun, and Andreas Beering. They are excellent human beans, professional, supportive, and brilliant minds. I also want to thank David Rotermund and Klaus Pawelzik (Institute for Theoretical Physics, University of Bremen) for their collaboration and guidance. A special thanks to all the students I have supervised during my Ph.D. for helping me understand life better. I would like to thank Kerstin Janssen and Peter Lutzen for their support in the research department. I would also like to thank the University of Bremen and the Studierendenwerk for providing the organization and facilities. I am sure that it would not be possible to reach this point without your kind support.

I would like to thank Atena Berhang, my life-partner, and her family for their sweet and constant support. Likewise, I want to thank my parents for their effort and encouragement to practice virtue, universal love, fraternity, brotherhood, and morality since my childhood. I want to thank my brothers Kevin and Efren, great sages, my best friends. I also thank other family and friends for their support and good wishes during my doctorate.

I would like to thank Marcus Aurelius, Epictetus, Seneca, and Nezahualcóyotl for their meditations, thoughts, poems, and mentorship.

Yarib Nevarez

Bremen, May 2022.



# Abstract

The use of Artificial Intelligence (AI) is entering a new era based on the use of ubiquitous embedded connected devices. The sustainability of this transformation requires the adoption of design techniques that reconcile accurate results with cost-effective system architectures. As such, improving the efficiency of AI hardware engines as well as machine learning (ML) portability must be considered.

In the emerging era of Industry 4.0, ML algorithms yield the power of AI to massively ubiquitous Internet-of-Things (IoT) devices. Applications in this field become smarter and more profitable as the availability of big data gets expanded, driving evolution of many aspects in science, industry, and daily life. However, state-of-the-art ML algorithms, specially spiking neural network (SNN) and convolutional neural network (CNN), represent elevated computational and energy costs. Therefore, hardware efficiency is one of the major goals to innovate compute engines as they are the machinery of the future.

Energy, performance, and chip-area are the key design concerns in computer systems. Considering the intrinsic error resilience of ML algorithms, paradigms such as approximate computing come to the rescue by offering promising efficiency gains to assist the aforementioned design concerns. Approximation techniques are widely used in ML algorithms at the model-structure as well as at the hardware processing level. However, state-of-the-art methods do not sufficiently address accelerator designs for artificial neural network (ANN), in particular with floating-point (FP) computation.

To sustain the continuous expansion of ML applications on cost-effective compute devices, approximate computing will gradually transform from a design alternative to an essential prerequisite. This dissertation focuses on the investigation of design methodologies to exploit the intrinsic error resilience of ML algorithms to optimize FP inference in low-power embedded systems.

In the field of SNN, this dissertation presents a hardware design methodology for low-power inference of Spike-by-Spike (SbS) neural networks targeting embedded applications. This ML algorithm provides exceptional noise robustness and reduced complexity compared to conventional SNN with leaky integrate and fire (LIF) mechanism. However, SbS networks represent a

---

memory footprint and a computational cost unsuitable for embedded applications. To address this problem, this work exploits the intrinsic error resilience of SbS to improve performance and to reduce hardware complexity. More precisely, we design a vector dot-product module based on approximate computing with configurable quality using hybrid custom FP and logarithmic number representations. This approach reduces computational run-time, memory footprint, and power dissipation while preserving inference accuracy. To demonstrate this approach, we address a design exploration flow with high-level synthesis (HLS) on a field-programmable gate array (FPGA). The proposed design reduces  $20.5\times$  run-time and  $8\times$  weight memory footprint, with less than 0.5% of accuracy degradation without retraining on a handwritten digit classification task.

In the field of CNN, this dissertation presents a hardware design methodology for low-power inference targeting sensor analytics applications. In this work, we present the Hybrid-Float6 (HF6) quantization and its dedicated hardware processor. We propose an optimized FP Multiply-Accumulate Unit (MAC) hardware by reducing the mantissa multiplication to a multiplexer-adder operation. We exploit the intrinsic error tolerance of neural networks to further reduce the hardware design with approximation on the subnormal number computation. To preserve model accuracy, we present a quantization-aware training (QAT) method, which in some cases improves accuracy. We demonstrate this concept in 2D convolution layers. We present a lightweight tensor processor (TP) implementing a pipelined vector dot-product. For ML portability, the custom FP representation is wrapped in the standard format, which is automatically extracted by the proposed hardware. The hardware/software architecture is integrated with TensorFlow (TF) Lite. We evaluate the applicability of our approach with a CNN-regression model for anomaly localization in a structural health monitoring (SHM) application based on acoustic emission (AE). The embedded hardware/software framework is demonstrated on XC7Z007S as the smallest Zynq-7000 system-on-chip (SoC). The proposed implementation achieves a peak power efficiency and acceleration of 5.7 GFLOPS/s/W and  $48.3\times$ , respectively.

The outcome of this dissertation aims to contribute to the rise of a sustainable next generation of low-power FP neural network processors with ML portability as a design philosophy.

# Contents

|                                                                 |           |
|-----------------------------------------------------------------|-----------|
| <b>1. Introduction</b>                                          | <b>1</b>  |
| 1.1. Preamble . . . . .                                         | 2         |
| 1.1.1. Industry 4.0 . . . . .                                   | 2         |
| 1.1.2. Internet-of-Things in Industry . . . . .                 | 2         |
| 1.1.3. Artificial Intelligence in Internet-of-Things . . . . .  | 2         |
| 1.1.4. Error Tolerance in Machine Learning Algorithms . . . . . | 2         |
| 1.2. Problem Statement . . . . .                                | 3         |
| 1.3. Research Objective . . . . .                               | 4         |
| 1.4. Working Hypothesis . . . . .                               | 4         |
| 1.4.1. Network Compression and Quantization . . . . .           | 4         |
| 1.4.2. Approximate Computing . . . . .                          | 4         |
| 1.5. Motivation . . . . .                                       | 5         |
| 1.5.1. Spike-by-Spike Neural Networks . . . . .                 | 5         |
| 1.5.2. Convolutional Neural Networks . . . . .                  | 6         |
| 1.6. Main Contribution . . . . .                                | 6         |
| 1.6.1. Spike-by-Spike Neural Networks . . . . .                 | 6         |
| 1.6.2. Convolutional Neural Networks . . . . .                  | 7         |
| 1.7. Publications . . . . .                                     | 8         |
| 1.8. Dissertation Outline . . . . .                             | 9         |
| <b>2. Background</b>                                            | <b>11</b> |
| 2.1. Spike-by-Spike Neural Networks . . . . .                   | 11        |
| 2.1.1. Basic Network Overview . . . . .                         | 12        |
| 2.1.2. Computational Cost . . . . .                             | 12        |
| 2.1.3. Error Tolerance . . . . .                                | 14        |
| 2.2. Conv2D Tensor Operation . . . . .                          | 15        |
| 2.3. Floating-point Number Representation . . . . .             | 15        |

|                                                                                                          |           |
|----------------------------------------------------------------------------------------------------------|-----------|
| <b>3. Accelerating Spike-by-Spike Neural Networks</b>                                                    | <b>19</b> |
| 3.1. Introduction . . . . .                                                                              | 19        |
| 3.2. Related Work . . . . .                                                                              | 23        |
| 3.2.1. Network Compression . . . . .                                                                     | 23        |
| 3.2.2. Classical Approximate Computing . . . . .                                                         | 24        |
| 3.2.3. Spike-by-Spike Neural Networks Accelerators . . . . .                                             | 25        |
| 3.3. System Design . . . . .                                                                             | 25        |
| 3.3.1. Hardware Architecture . . . . .                                                                   | 25        |
| 3.3.2. Conv Processing Unit . . . . .                                                                    | 27        |
| Configuration Mode . . . . .                                                                             | 27        |
| Computation Mode . . . . .                                                                               | 27        |
| 3.3.3. Dot-Product Hardware Module . . . . .                                                             | 29        |
| Dot-Product with Standard Floating-Point Computation . . . . .                                           | 30        |
| Dot-Product with Hybrid Custom Floating-Point and Logarithmic Ap-<br>proximation . . . . .               | 30        |
| 3.4. Experimental Results . . . . .                                                                      | 33        |
| 3.4.1. Performance Benchmark . . . . .                                                                   | 34        |
| Benchmark on Embedded CPU . . . . .                                                                      | 34        |
| Benchmark on Processing Units with Standard Floating-Point Compu-<br>tation . . . . .                    | 35        |
| Benchmark on Noise Tolerance Plot . . . . .                                                              | 39        |
| 3.4.2. Design Exploration with Hybrid Custom Floating-Point and Logarith-<br>mic Approximation . . . . . | 39        |
| Parameters for Numeric Representation of Synaptic Weight Matrix . . . . .                                | 39        |
| Design Exploration for Dot-product with Hybrid Custom Floating-Point<br>Approximation . . . . .          | 40        |
| Design Exploration for Dot-Product whit Hybrid Logarithmic Approx-<br>imation . . . . .                  | 42        |
| 3.4.3. Results and Discussion . . . . .                                                                  | 44        |
| 3.5. Conclusions . . . . .                                                                               | 46        |
| <b>4. Accelerating Convolutional Neural Networks</b>                                                     | <b>49</b> |
| 4.1. Introduction . . . . .                                                                              | 49        |
| 4.2. Related Work . . . . .                                                                              | 52        |
| 4.2.1. Hybrid Custom Floating-Point . . . . .                                                            | 52        |
| 4.2.2. Low-Precision Floating-Point . . . . .                                                            | 53        |

|                                                                                                           |                                             |           |
|-----------------------------------------------------------------------------------------------------------|---------------------------------------------|-----------|
| 4.2.3.                                                                                                    | Low-Power Hardware Architectures . . . . .  | 53        |
| 4.3.                                                                                                      | System Design . . . . .                     | 53        |
| 4.3.1.                                                                                                    | Base Embedded System Architecture . . . . . | 53        |
| 4.3.2.                                                                                                    | Tensor Processor . . . . .                  | 54        |
| Modes of Operation . . . . .                                                                              | 54                                          |           |
| Dot-Product with Hybrid Floating-Point Computation . . . . .                                              | 55                                          |           |
| Multiply-Accumulate . . . . .                                                                             | 55                                          |           |
| On-Chip Memory Utilization . . . . .                                                                      | 58                                          |           |
| 4.3.3.                                                                                                    | Training Method . . . . .                   | 59        |
| Training with Iterative Early Stop . . . . .                                                              | 59                                          |           |
| Quantization Aware Training . . . . .                                                                     | 60                                          |           |
| 4.3.4.                                                                                                    | Embedded software architecture . . . . .    | 61        |
| 4.4.                                                                                                      | Experimental Results . . . . .              | 63        |
| 4.4.1.                                                                                                    | Sensor Analytics Application . . . . .      | 63        |
| Experimental Setup . . . . .                                                                              | 63                                          |           |
| Data Sets . . . . .                                                                                       | 63                                          |           |
| CNN-Regression Model . . . . .                                                                            | 66                                          |           |
| 4.4.2.                                                                                                    | Training . . . . .                          | 66        |
| Base Model . . . . .                                                                                      | 66                                          |           |
| TensorFlow Lite 8-bit Quantization . . . . .                                                              | 67                                          |           |
| Quantization Aware Training for Hybrid-Float6 . . . . .                                                   | 68                                          |           |
| Quantization Aware Training for Hybrid-Logarithmic 6-bit . . . . .                                        | 70                                          |           |
| 4.4.3.                                                                                                    | Hardware Design Exploration . . . . .       | 70        |
| Benchmark on Embedded CPU . . . . .                                                                       | 70                                          |           |
| Benchmark on Tensor Processor with Standard Floating-Point Hardware<br>using Xilinx LogiCORE IP . . . . . | 71                                          |           |
| Tensor Processor with Hybrid-Float6 Hardware . . . . .                                                    | 72                                          |           |
| 4.4.4.                                                                                                    | Discussion . . . . .                        | 73        |
| Training and Quantization . . . . .                                                                       | 73                                          |           |
| Implementation and Performance . . . . .                                                                  | 75                                          |           |
| SoC Design and Compatibility . . . . .                                                                    | 76                                          |           |
| Future Work . . . . .                                                                                     | 76                                          |           |
| 4.5.                                                                                                      | Conclusions . . . . .                       | 77        |
| <b>5.</b>                                                                                                 | <b>Conclusion and Outlook</b>               | <b>79</b> |
| 5.1.                                                                                                      | Summary of Contributions . . . . .          | 80        |

*Contents*

---

|                              |           |
|------------------------------|-----------|
| 5.2. Future Works . . . . .  | 80        |
| <b>A. Appendix</b>           | <b>81</b> |
| A.1. SbS algorithm . . . . . | 81        |



---

# 1. Introduction

---

|                                                                 |          |
|-----------------------------------------------------------------|----------|
| <b>1.1. Preamble . . . . .</b>                                  | <b>2</b> |
| 1.1.1. Industry 4.0 . . . . .                                   | 2        |
| 1.1.2. Internet-of-Things in Industry . . . . .                 | 2        |
| 1.1.3. Artificial Intelligence in Internet-of-Things . . . . .  | 2        |
| 1.1.4. Error Tolerance in Machine Learning Algorithms . . . . . | 2        |
| <b>1.2. Problem Statement . . . . .</b>                         | <b>3</b> |
| <b>1.3. Research Objective . . . . .</b>                        | <b>4</b> |
| <b>1.4. Working Hypothesis . . . . .</b>                        | <b>4</b> |
| 1.4.1. Network Compression and Quantization . . . . .           | 4        |
| 1.4.2. Approximate Computing . . . . .                          | 4        |
| <b>1.5. Motivation . . . . .</b>                                | <b>5</b> |
| 1.5.1. Spike-by-Spike Neural Networks . . . . .                 | 5        |
| 1.5.2. Convolutional Neural Networks . . . . .                  | 6        |
| <b>1.6. Main Contribution . . . . .</b>                         | <b>6</b> |
| 1.6.1. Spike-by-Spike Neural Networks . . . . .                 | 6        |
| 1.6.2. Convolutional Neural Networks . . . . .                  | 7        |
| <b>1.7. Publications . . . . .</b>                              | <b>8</b> |
| <b>1.8. Dissertation Outline . . . . .</b>                      | <b>9</b> |

---

## *1. Introduction*

---

### **1.1. Preamble**

#### **1.1.1. Industry 4.0**

Industry is the piece of an economy that produces material goods which are highly mechanized and automatized. Since the beginning of industrialization, technological leaps have led to paradigm shifts that are now called "industrial revolutions": from mechanization, electrification, and later, digitalization (the so-called 3rd industrial revolution). Based on the advanced digitalization within factories, the combination of Internet technologies and future-oriented technologies in the field of "smart" things (machines and products) seems to result in a new fundamental paradigm shift in industrial production. Emerging from this future expectation, the term "Industry 4.0" was established for an expected "4th industrial revolution" [1].

#### **1.1.2. Internet-of-Things in Industry**

To build the emerging environment of Industry 4.0, disruptive technologies are required to handle autonomous communications between all industrial embedded computers throughout the factory and the Internet. Such technologies offer the potential to transform the industry along the entire production chain and stimulate productivity and overall economic growth [2]. These technologies include cloud computing, big data, and specially a new generation of IoT devices fused with cyber-physical systems (CPS), safety-security, augmented reality, hardware accelerators, and AI in general [3].

#### **1.1.3. Artificial Intelligence in Internet-of-Things**

The continuous evolution of AI algorithms and IoT devices has not only made AI the major workload running on these embedded devices, but AI has become the main approach for industrial solutions, especially in the rise of Industry 4.0 [3]. In fact, there is a clear motivation to run AI/ML algorithms on IoT devices because of [4]: (1) feasibility of mission-critical real-time processing/inference; (2) privacy and security of data; (3) offline operation capability; and (4) robustness for stressed communication. Hence, the traditional term of IoT has also been redefined as AI of Things (AIoT) to emphasize the impact of AI/ML on this technology [5].

#### **1.1.4. Error Tolerance in Machine Learning Algorithms**

An algorithm can be regarded as error-tolerant or error-resilient when it provides a result with the required accuracy while utilizing processing components with a certain degree of inaccuracy.

There are several reasons why an algorithm/application is tolerant of errors as discussed in [6]. These include noisy or redundant data of the algorithm, approximate or probabilistic computations within the algorithm, and a range of acceptable outcomes. This is the case of ML algorithms for AI applications.

## 1.2. Problem Statement

The problem lies in the fact that state-of-the-art AI/ML algorithms, particularly ANNs, are highly compute and data intensive. This represents significant computational challenges across the spectrum of computing hardware, specially in the scope of embedded systems [7]. One of the most deployed applications is computer vision using CNNs. Compared to the conventional image processing methods, the CNN accuracy has improved significantly that by 2015, a human can no longer beat a computer in image classification [4]. The early development of CNNs before 2016 mainly focused on accuracy improvement without considering computational costs. While accuracy of deep CNN for image classification improved 24% between 2012 and 2016, the demand on hardware resources increased more than 10 $\times$ . Starting from 2017, significant attention was paid to improve hardware efficiency in terms of compute power, memory bandwidth, and power consumption, while maintaining accuracy at a similar level to human perception [7].

Consequently, the recent breakthroughs in AI/ML applications have brought significant advancements in neural network processors [14]. These rapid evolution, however, came at the cost of an important demand for computational power. Hence, to bring the inference speed to an acceptable level, custom application-specific integrated circuit (ASIC) with neural processing unit (NPU) are becoming ubiquitous in both embedded and general purpose computing. NPUs perform several tera operations per second in a confined area. Therefore, they become subject to elevated on-chip power densities that rapidly result in excessive on-chip temperatures during operation [15]. These design efforts focused on power-hungry parallel computing techniques, yet unsustainable for resource-constrained devices. As a result, radical changes to conventional computing approaches are required in order to sustain and improve performance while satisfying mandatory energy and temperature constraints [16].

In the state-of-the-art research, we find plenty of hardware architectures for CNN accelerators implemented in FPGA. Most of the research implements fixed-point quantization, and very limited research focuses on FP. Moreover, to the best of our knowledge, there is no research work exploring FP inference for low-power embedded systems.

### **1.3. Research Objective**

The main objective for this doctoral research is investigating hardware design methodologies for low-power FP neural network accelerators based on approximate computing in the scope of embedded systems.

### **1.4. Working Hypothesis**

To overcome the problem, based on the error resilience of ML algorithms, an evident solution is approximate computing. This computing paradigm has been used in a wide range of applications to increase the hardware computational efficiency [17]. For neural network applications, two main approximation strategies are used, namely network compression and classical approximate computing [18].

#### **1.4.1. Network Compression and Quantization**

Researchers focusing on embedded applications started lowering the precision of weights and activation maps to shrink the memory footprint of the large number of parameters representing ANNs, a method known as network quantization. In this manner, reduced bit precision causes a small accuracy loss [19, 20, 21, 22]. In addition to quantization, network pruning reduces the model size by removing structural portions of the parameters and its associated computations [23, 24]. This method has been identified as an effective technique to improve the efficiency of CNN for applications with limited computational budget [25, 26, 27]. These techniques leverage the intrinsic error-tolerance of neural networks, as well as their ability to recover from accuracy degradation while training.

#### **1.4.2. Approximate Computing**

Approximate computing is an emerging design paradigm that is able to tradeoff computation quality (e.g., accuracy) and computational efficiency (e.g., in run-time, chip-area, and/or energy) by exploiting the error resilience properties of algorithms/applications [16, 28]. Data redundancy of neural networks incorporate a certain degree of resilience against random external and internal perturbations, for instance, noisy inputs and random hardware errors. This property can be exploited in a cross-layer resilience approach [29]: by leveraging error tolerance at algorithmic-level, it can be allowed a certain degree of inaccuracies at the computing-level. This approach

consists of designing processing elements that approximate their computation by employing cleverly modified algorithmic logic units [17].

Approximate computing techniques allow substantial enhancement in processing efficiency with moderated accuracy degradation. Some research papers have shown the feasibility of applying approximate computing to the inference stage of neural networks [30, 17, 31, 32, 33, 34]. Such techniques usually demonstrated small inference accuracy degradation, but significant enhancement in computational performance, chip-area, and energy consumption. Hence, by taking advantage of the intrinsic error-tolerance of neural networks, approximate computing is positioned as a promising approach for inference on resource-limited devices. Nonetheless, the complex state-of-the-art of FP CNN inference has not been sufficiently explored with approximate computing techniques.

## 1.5. Motivation

The use of AI/ML is entering a new era with ubiquitous embedded connected devices. This transformation requires design techniques that reconcile accurate results with cost-effective system architectures. However, state-of-the-art AI/ML algorithms represent high computational and energy costs. This compromises the sustainability of the progressive expansion towards massive ubiquitous AI devices. Therefore, hardware efficiency is one of the major goals to innovate compute engines. This section presents the motivations to investigate design methodologies for low-power hardware acceleration for SbS and CNNs.

### 1.5.1. Spike-by-Spike Neural Networks

SNNs offer advantageous robustness and the potential to achieve a power efficiency closer to that of the human brain. SNNs operate reliably using stochastic elements that are inherently non-reliable mechanisms [35]. This provides superior resistance against adversary attacks [36, 37]. Beside robustness, SNNs have further advantages like the possibility of a more efficient asynchronous parallelization and higher energy efficiency than conventional ANNs.

The Spike-by-Spike neural network is a remarkable model for its reduced complexity, which is on the less realistic side of the SNN scale of biological realism [38, 36]. Consequently, the hardware complexity of SbS network implementations is greatly reduced [39]. In spite of this, SbS still uses stochastic spikes as a means of transmitting information between populations of neurons and thus retains the advantageous robustness of SNNs. A significant research effort has been performed in SNN accelerators, see e.g. [40, 18, 41, 42, 43, 44].

## 1. Introduction

---

However, hardware accelerators that focus on SbS have only been partially investigated so far [39]. Enhanced SbS accelerators will have a double impact. From application point of view, they will contribute to the deployment of robust neural networks in small embedded systems [45]; from a scientific point of view, they will facilitate fundamental research for neuroscience [36, 46, 47].

### 1.5.2. Convolutional Neural Networks

CNNs represent the essential building blocks in 2D pattern analytics. Sensor-based applications such as mechanical fault diagnosis [48, 49], structural health monitoring [50], human activity recognition (HAR) [51], hazardous gas detection[52] have been powered by CNN models in industry and academia. CNN models provide advantages such as local dependency, scale invariance, and noise resilience in analytics[31]. However, these models are computationally intensive and power-hungry. This is particularly challenging for low-power embedded applications in the field of IoT. As a result, numerous commercial ASIC and FPGA accelerators have been proposed, targeting both high performance computing (HPC) for data-centers and embedded systems applications.

However, most accelerators have been implemented to target mid- to high-range FPGAs for computationally intensive CNN models such as AlexNet, VGG-16, and ResNet-18. The main drawbacks of these implementations are power supply demands, physical dimensions, heat sink requirements, air cooling, and a resulting high price. In some cases, these implementations are not feasible for ubiquitous low-power/resource-constrained applications. Furthermore, reducing the compute hardware with aggressive quantization such as binary [19], ternary [53], and mixed precision (2-bit activations and ternary weights) [54] typically incur significant accuracy degradation for very low precisions, especially for complex problems[55].

## 1.6. Main Contribution

This thesis contributes to hardware design methodologies for FP neural network accelerators based on approximate computing in the scope of low-power embedded systems. The contributions for SbS and CNN hardware accelerators are listed below.

### 1.6.1. Spike-by-Spike Neural Networks

1. We develop a hardware component for FP vector dot-product approximation. This design increase the performance of computation by performing element-wise multiplication with

a quality configurable design based on bit truncation and denormalized accumulation.

2. We address a design exploration with the proposed dot-product approximation using synaptic weight vectors with custom FP and logarithmic representation. We evaluate inference latency, accuracy degradation, resource utilization and power dissipation. Experimental results demonstrate  $20.5\times$  latency enhancement versus embedded CPU (ARM Cortex-A9 at 666MHz), and less than 0.5% of accuracy degradation on a handwritten digit recognition task (MNIST).
3. We propose a noise tolerance plot as quality monitor, which serves as an intuitive visual model to provide insights into the accuracy degradation of SbS networks under approximate processing effects.
4. Our proposed design for FP dot-product approximation is adaptable as a building block for other error resilient applications (e.g., image/video processing).

### 1.6.2. Convolutional Neural Networks

1. We present the HF6 quantization and its dedicated hardware design. We propose an optimized hardware MAC by reducing the mantissa multiplication to a multiplexer-adder operation. We exploit the intrinsic error tolerance of ANN to further reduce the hardware design with approximation. To preserve model accuracy, we present a quantization-aware training method, which in some cases improves accuracy.
2. We develop a custom hardware/software co-design framework for sensor analytics applications on low-power embedded FPGAs. This architecture integrates TensorFlow Lite.
3. We present a customizable tensor processor as a dedicated hardware for HF6. This design computes *Conv2D* tensor operations employing a pipelined vector dot-product with parametrized on-chip memory utilization.
4. We demonstrate the potential of our approach with a CNN-regression model for anomaly localization in structural health monitoring based on acoustic emissions. We address a hardware design exploration evaluating accuracy, compute performance, hardware resource utilization, and energy consumption.

## 1.7. Publications

The outcome of the research of this dissertation, including the collaborative works with our research partners is a list of publications including [45, 56, 57]. In the following, a complete list of the related publications are itemized.

### Journal Articles

1. **Yarib Nevarez**, David Rotermund, Klaus R Pawelzik, and Alberto Garcia-Ortiz, "Accelerating Spike-by-Spike Neural Networks on FPGA With Hybrid Custom Floating-Point and Logarithmic Dot-Product Approximation," *IEEE Access*, vol. 9, pp. 80603–80620, May 2021, doi: 10.1109/ACCESS.2021.3085216.
2. **Yarib Nevarez**, David Rotermund, Klaus R Pawelzik, and Alberto Garcia-Ortiz, "CNN Sensor Analytics with Hybrid-Float6 on Low-Power Resource-Constrained Embedded FPGAs," *IEEE Access*, vol. 9, pp. 80603–80620, May 2021, doi: 10.1109/ACCESS.2021.3085216.

### Conference Proceedings

3. **Yarib Nevarez**, Alberto Garcia-Ortiz, David Rotermund, and Klaus R Pawelzik, "Accelerator framework of spike-by-spike neural networks for inference and incremental learning in embedded systems," 2020 9th International Conference on Modern Circuits and Systems Technologies (MOCAST), Bremen, 2020, pp. 1–5, doi: 10.1109/MOCAST49295.2020.9200288.
4. Wanli Yu, Ardalan Najafi, **Yarib Nevarez**, Yanqiu Huang and Alberto Garcia-Ortiz, "TAAC: Task Allocation Meets Approximate Computing for Internet of Things," 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Sevilla, 2020, pp. 1-5, doi: 10.1109/ISCAS45731.2020.9180895.
5. Amir Najafi, Ardalan Najafi, **Yarib Nevarez** and Alberto Garcia-Ortiz, "Learning-Based On-Chip Parallel Interconnect Delay Estimation," 2022 11th International Conference on Modern Circuits and Systems Technologies (MOCAST), Bremen, 2022, pp. 1–5, doi: 10.1109/MOCAST49295.2020.9200288.

## 1.8. Dissertation Outline

This dissertation is organized in three main parts: an introduction, where the state of the art and related background are stated; a central core, where the proposed design methodologies and validation are presented; and a final part with the conclusion. More precisely:

**I Introduction:** Chapter 2 introduces the background related to SbS, CNN, and FP number representation.

**II Core:** the proposed hardware design methodologies for SbS and CNN accelerators are presented in Chapter 3 and Chapter 4, respectively.

**III Conclusions:** the final conclusions are presented in Chapter 5.



---

## 2. Background

---

|                                                  |           |
|--------------------------------------------------|-----------|
| <b>2.1. Spike-by-Spike Neural Networks</b>       | <b>11</b> |
| 2.1.1. Basic Network Overview                    | 12        |
| 2.1.2. Computational Cost                        | 12        |
| 2.1.3. Error Tolerance                           | 14        |
| <b>2.2. Conv2D Tensor Operation</b>              | <b>15</b> |
| <b>2.3. Floating-point Number Representation</b> | <b>15</b> |

---

### 2.1. Spike-by-Spike Neural Networks

Technically, SbS is a spiking neural network approach based on a generative probabilistic model. It iteratively finds an estimate of its input probability distribution  $p(s)$  (i.e. the probability of input node  $s$  to stochastically send a spike) by its latent variables via  $r(s) = \sum_i h(i)W(s|i)$ , where  $\vec{h}$  is an inference population composed of a group of neurons that compete with each other. An inference population (IP) sees only the spikes  $s_t$  (i.e. the index identifying the input neuron  $s$  which generated that spike at time  $t$ ) produced by its input neurons, not the underlying input probability distribution  $p(s)$  itself. By counting the spikes arriving at a group of SbS neurons,  $p(s)$  is estimated by  $\hat{p}(s) = 1/T \sum_t \delta_{s,s^t}$  after  $T$  spikes have been observed in total. The goal is to generate an internal representation  $r(s)$  from the string of incoming spikes  $s_t$  such that the negative logarithm of the likelihood  $L = C - \sum_\mu \sum_s \hat{p}_\mu(s) \log(r_\mu(s))$  is minimized.  $C$  is a constant which is independent of the internal representation  $r_\mu(s)$  and  $\mu$  denotes one input pattern from an ensemble of input patterns. Applying a multiplicative gradient descent method on  $L$ , an algorithm for iteratively updating  $h_\mu(i)$  with every observed input spike  $s_t$  could be derived [36]

$$h_\mu^{new}(i) = \frac{1}{1 + \epsilon} \left( h_\mu(i) + \epsilon \frac{h_\mu(i)W(s_t|i)}{\sum_j h_\mu(j)W(s_t|j)} \right) \quad (2.1)$$

## 2. Background

---

where  $\epsilon$  is a parameter that also controls the strength of sparseness of the distribution of latent variables  $h_\mu(i)$ . Furthermore,  $L$  can also be used to derive online and batch learning rules for optimizing the weights  $W(s|i)$ . The interested reader is referred to [36] for a more detailed exposition.

From a practical point of view, SbS provides a mechanism to obtain a sparse representation of input patterns. Given a set of training samples  $\{x_\eta\}$ , it learns weights ( $W$ ), that allow to express the input patterns as a linear sparse non-negative combination of features. During inference, it provides a mechanism for expressing each test input  $x_\mu$  as  $x_\mu \approx W h_\mu$  where all entries are non-negative.

The inference procedure consists in generating indices  $s_t$  distributed according to a categorical distribution of the input pattern  $s_t \sim \text{Categorical}(x_\mu(0), x_\mu(1), \dots, x_\mu(N - 1))$ . Starting with a random  $h$  and executing iteratively **Eq.** (2.1) the SbS algorithms finds  $h_\mu$ . The fundamental concept of SbS can be extended from vector to matrix inputs. In this case, the linear operation  $W h_\mu$  can be replaced by a convolution to obtain a convolutional SbS layer. A detailed description of the SbS algorithm is presented in the Appendix A

### 2.1.1. Basic Network Overview

SbS network models can be constructed in sequential layered structures [38]. Each layer consists of many IPs (represented by  $\vec{h}$ ), while the communication between them is organized by a low bandwidth signal – the spikes.

The SbS layer update is summarized in Algorithm 1. This is an iterative algorithm, where we denote the number of spikes ( $N_{Spk}$ ) as the number of iterations. As a generative model, each iteration updates the internal representation ( $H$ ) based on the input spikes ( $S_t^{in}$ ). A basic SbS network architecture for handwritten digit classification (MNIST) is shown in **Fig.** 2.1 and **Tab.** 2.1. Each IP is an independent computational entity, this allows to design specialized hardware architectures that can be massively parallelized (see **Fig.** 2.2).

### 2.1.2. Computational Cost

The number of MAC operations required for inference of an SbS layer is defined by  $NOPS_{MAC} = N_{Spk} N_X N_Y K_X K_Y (3N_H + 2)$ , where  $N_{Spk}$  is the number of spikes (iterations),  $N_X N_Y$  is the size of the layer,  $K_X K_Y$  is the size of the kernel for convolution/pooling, and  $N_H$  is the length of  $\vec{h}$ . The computational cost of SbS network models is higher compared to equivalent CNN models and lower compared to regular SNN models (e.g., LIF) [58].

---

**Algorithm 1:** SbS layer update.

---

```

1: for  $t \leftarrow 0$  to  $N_{Spk} - 1$  do
2:   for  $x \leftarrow 0, y \leftarrow 0$  to  $N_X - 1, N_Y - 1$  do
3:      $S_t^{out}(x, y) \sim \text{Categorical}(H(x, y, :))$ 
4:     for  $\Delta_X \leftarrow 0, \Delta_Y \leftarrow 0$  to  $K_X - 1, K_Y - 1$  do
5:        $spk \leftarrow S_t^{in}(x + \Delta_X, y + \Delta_Y)$ 
6:       for  $i \leftarrow 0$  to  $N_H - 1$  do
7:          $\Delta h(i) \leftarrow H(x, y, i) \cdot W(\Delta_X, \Delta_Y, spk, i)$ 
8:          $r \leftarrow r + \Delta h(i)$ 
9:       end for
10:      for  $i \leftarrow 0$  to  $N_H - 1$  do
11:         $H^{new}(x, y, i) \leftarrow \frac{1}{1+\epsilon} (H(x, y, i) + \frac{\epsilon}{r} \Delta h(i))$ 
12:      end for
13:    end for
14:  end for
15: end for

```

---



Figure 2.1.: SbS network architecture for handwritten digit classification task.



Figure 2.2.: SbS IPs as independent computational entities, (a) illustrates an input layer with a massive amount of IPs operating as independent computational entities, (b) shows a hidden layer with an arbitrary amount of IPs as independent computational entities, (c) exhibits a set of neurons grouped in an IP.

## 2. Background

---

Table 2.1.: SbS network architecture for handwritten digit classification task.

| Layer ( $H^l$ )           | Layer size |       |       | Kernel size |       |
|---------------------------|------------|-------|-------|-------------|-------|
|                           | $N_X$      | $N_Y$ | $N_H$ | $K_X$       | $K_Y$ |
| Input ( $H^X$ )           | 28         | 28    | 2     | -           | -     |
| Convolution ( $H^1$ )     | 24         | 24    | 32    | 5           | 5     |
| Pooling ( $H^2$ )         | 12         | 12    | 32    | 2           | 2     |
| Convolution ( $H^3$ )     | 8          | 8     | 64    | 5           | 5     |
| Pooling ( $H^4$ )         | 4          | 4     | 64    | 2           | 2     |
| Fully connected ( $H^5$ ) | 1          | 1     | 1024  | 4           | 4     |
| Output ( $H^Y$ )          | 1          | 1     | 10    | 1           | 1     |

### 2.1.3. Error Tolerance

To illustrate the error tolerance/resilience of SbS networks, we present a classification performance under positive additive uniformly distributed noise as external disturbance. **Fig. 2.3** presents a comparison of the classification performance of an SbS network and a standard CNN, with the same amount of neurons per layer as well as the same layer structure. We trained both neural networks for handwritten digit classification on MNIST dataset [59] (see [38] for details). The figure shows the correctness for the MNIST test set with its 10,000 patterns in dependency of the noise level for positive additive uniformly distributed noise. The blue curve shows the performance for the CNN, while the red curve shows the performance for the SbS network with 1200 spikes (iterations). Beginning with a noise level of 0.1, the respective performances are different with a p - level of at least  $10^{-6}$  (tested with the Fisher exact test). Increasing the number of spikes per SbS population to 6000 (performance values shown as black stars), shows that more spikes can improve the performance under noise even more.



Figure 2.3.: (a) Performance classification of SbS NN versus equivalent CNN, and (b) example of the first pattern in the MNIST test data set with different amounts of positive additive uniformly distributed noise.

## 2.2. Conv2D Tensor Operation

A convolutional layer aims to learn feature representations from an input layer. The convolution layer is made of convolution kernels that are used to compute feature maps. Each unit of a feature map is connected to a region of neighboring units on the input maps (from previous layer). Such a neighborhood of the previous layer is known as the receptive field of the unit. A new feature map can be obtained by first convolving the input maps with a learned kernel and then applying a nonlinear elementwise activation function to the convolved results. All spatial locations on the input maps share a kernel to generate a feature map. All feature maps are obtained by convolving several different kernels [60].

The 2D convolution process is performed by the *Conv2D* tensor operation, described in **Eq. (2.2)**, where  $h$  is the input tensor containing the feature maps,  $W$  is the convolution kernels (known as filters), and  $b$  is the bias vector for the output feature maps [61].  $K \times L \times M$  is the receptive field size,  $K \times L$  is the convolution kernel, and  $M$  is the number of input channels/feature maps. We denote *Conv* as *Conv2D* operator.

$$\text{Conv}(W, h)_{i,j,o} = \sum_{k,l,m}^{K,L,M} h_{(i+k,j+l,m)} W_{(o,k,l,m)} + b_o \quad (2.2)$$

## 2.3. Floating-point Number Representation

The representation of every numerical value, in any number system, is made of an integer and a fractional part. The border that delimits them is called the radix point. The fixed-point format for representing numeric values derives its name from the fact that in this format, the base point is fixed at a certain position. For integer numbers, this position is at the right of the least significant digit.

In scientific computation, it is often necessary to represent very large and very small values. This is difficult to achieve using the fixed-point format because the bit size/width required to maintain both the desired precision and the desired range are very large. In such situations, FP formats are used to represent real numbers. Each FP number can be divided into three fields: sign  $S$ , exponent  $E$ , and mantissa  $M$ . Using the binary number system, it is possible to represent any FP number as:

$$(-1)^S \times 1.M \times 2^{E-B} \quad (2.3)$$

## 2. Background

---

In FP representations the exponent is biased. This bias depends on the bit size of the exponent field in the particular format. This exponent bias is defined by **Eq.** (2.4), where  $E_{size}$  is the exponent bit size.

$$B = 2^{E_{size}-1} - 1 \quad (2.4)$$

There is a natural trade-off between small bit size requiring fewer hardware resources and larger bit size providing higher precision. Within a given total bit size, it is possible to assign various combinations of sizes to the exponent and mantissa fields, with wider exponents resulting in a higher range and wider mantissa resulting in better precision.

The most widely used format for FP arithmetic is the IEEE 754 standard [62]. The IEEE single-precision format (32-bit) is expressed by **Eq.** (2.3) with  $B = 127$ , 8 bits for the exponent and 23 bits for the mantissa, see **Fig. 2.4(a)**. In FP formats, the numbers are normalized, the leading one is an implicit bit, and only the fractional part is explicitly stored in the mantissa field.

Reduced bit size than those specified in the IEEE 754 standard are often sufficient to provide the desired precision. Reduced designs require fewer hardware resources enabling low-power implementations. In custom hardware designs, it is possible to customize the FP format implemented. In later sections, we use the term  $EaMb$  to denote FP formats, where  $a$  and  $b$  are the exponent and mantissa bit size, respectively. For example, E4M1 means 4-bit exponent and 1-bit mantissa, see **Fig. 2.4(d)**.

There are three special definitions in IEEE 754 standard. The first is subnormal numbers when  $E = 0$ , then **Eq.** (2.3) is modified to **Eq.** (2.5). Infinity and not a number (NaN) are the other two special cases but are not used in our work.



Figure 2.4.: Floating-point number representation.

$$(-1)^S \times 0.M \times 2^{1-B} \quad (2.5)$$



---

# **3. Accelerating Spike-by-Spike Neural Networks**

---

|                                                                                           |           |
|-------------------------------------------------------------------------------------------|-----------|
| <b>3.1. Introduction</b>                                                                  | <b>19</b> |
| <b>3.2. Related Work</b>                                                                  | <b>23</b> |
| 3.2.1. Network Compression                                                                | 23        |
| 3.2.2. Classical Approximate Computing                                                    | 24        |
| 3.2.3. Spike-by-Spike Neural Networks Accelerators                                        | 25        |
| <b>3.3. System Design</b>                                                                 | <b>25</b> |
| 3.3.1. Hardware Architecture                                                              | 25        |
| 3.3.2. Conv Processing Unit                                                               | 27        |
| 3.3.3. Dot-Product Hardware Module                                                        | 29        |
| <b>3.4. Experimental Results</b>                                                          | <b>33</b> |
| 3.4.1. Performance Benchmark                                                              | 34        |
| 3.4.2. Design Exploration with Hybrid Custom Floating-Point and Logarithmic Approximation | 39        |
| 3.4.3. Results and Discussion                                                             | 44        |
| <b>3.5. Conclusions</b>                                                                   | <b>46</b> |

---

## **3.1. Introduction**

The exponential improvement in computing performance and the availability of large amounts of data are boosting the use of artificial intelligence (AI) applications in our daily lives. Among the various algorithms developed over the years, neural networks (NNs) have demonstrated remarkable performance in a variety of image, video, audio, and text analytics tasks [63, 64].

### 3. Accelerating Spike-by-Spike Neural Networks

---

Historically, artificial neural networks (ANNs) can be classified into three different generations [65]: the first one is represented by the classical McCulloch and Pitts neuron model using discrete binary values as outputs; the second one is represented by more complex architectures as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) using continuous activation functions; while the third generation is represented by spiking neural networks using spikes as means for information exchange between groups of neurons. Although the AI research is currently dominated by deep neural networks (DNNs) from the second generation, the SNNs belonging to the third generation are receiving considerable attention [43, 36, 65, 66].

SNNs offer advantageous robustness and the potential to achieve a power efficiency closer to that of the human brain. SNNs operate reliably using stochastic elements that are inherently non-reliable mechanisms [35]. This provides superior resistance against adversary attacks [36, 37]. Beside robustness, SNNs have further advantages like the possibility of a more efficient asynchronous parallelization and higher energy efficiency than DNNs. For example, Loihi [44], a SNN developed by Intel, can solve LASSO optimization problems with an over three orders of magnitude better energy-delay product than conventional approaches. These advantages are motivating large research programs by major companies (e.g., Intel [44] and IBM [42]) as well as pan-european projects in the domain of spiking networks [43].

SNNs emulate the real behavior of neurons in different levels of detail. The more detailed the biological part is emulated, the greater the computational complexity [58, 67]. For example, Leaky Integrate-and-Fire (LIF) is a widely used model; however, this model is relatively complex for emulation in practical embedded applications with hardware and energy constraints.

Alternatively, the Spike-by-Spike (SbS) neural network is a remarkable model for its reduced complexity, which is on the less realistic side of the SNN scale of biological realism[38, 36]. Consequently, the hardware complexity of SbS network implementations is greatly reduced [45, 39]. In spite of this, SbS still uses stochastic spikes as a means of transmitting information between populations of neurons and thus retains the advantageous robustness of SNNs.

The conceptual model in SbS (see Sec. 2.1 for a short review) differs fundamentally from conventional ANNs since (a) the building blocks of the network are inference populations (IP) which are an optimized generative representation with non-negative values, (b) time progresses from one spike to the next, preserving the property of stochastically firing neurons, and (c) a network has only a small number of parameters, which is an advantageous noise-robust stochastic version of Non-Negative Matrix Factorization (NNMF). The SbS network is placed between non-spiking NN and stochastically spiking NN, which offers advantages from both structures [38]. On one hand, the SbS model incorporates the inherent robustness of SNNs, which provide the possibility of more efficient asynchronous parallelization and superior resilience against

disturbances based on the synaptic stochasticity; on the other hand, the SbS model incorporates the regular flow of information from CNNs, which are expressed on explicit vector operations.

As computational demanding algorithms, ANNs and SNNs in particular, must be addressed by specialized hardware architectures. A significant research effort has been performed in SNN accelerators, see e.g. [40, 18, 41, 42, 43, 44]. However, hardware accelerators that focus on SbS have only been partially investigated so far [45, 39]. Enhanced SbS accelerators will have a double impact. From an engineering point of view, they will contribute to the deployment of robust neural networks in small embedded systems[45]; from a scientific point of view, they will facilitate fundamental research for neuroscience [36, 46, 47].

A central point that can be optimized in current SbS accelerators is the use of approximation techniques. Most SbS models use floating-point numerical representation, which imposes high complexity of the required circuits for the floating-point operations. Model quantization has the potential to improve computational performance; however, this solution is often accompanied by quantization-aware training methods that, in some cases, are problematic or even inaccessible, particularly in deep SNN algorithms[68]. As an alternative, based on the relaxed need for fully precise or deterministic computation of neural networks, approximate computing techniques allow substantial enhancement in processing efficiency with moderated accuracy degradation. Some research papers have shown the feasibility of applying approximate computing to the inference stage of neural networks [30, 33, 32, 31]. Such techniques usually demonstrated small inference accuracy degradation, but significant enhancement in computational performance, resource utilization, and energy consumption. Hence, by taking advantage of the intrinsic error-tolerance of neural networks, approximate computing is positioned as a promising approach for inference on resource-limited devices.

In this paper, we accelerate SbS neural networks with a dot-product hardware design based on approximate computing with hybrid custom floating-point and logarithmic number representation. This hardware unit has a quality configurable scheme based on the bit truncation of the synaptic-weight vector. **Fig. 3.1** illustrates the dot-product hardware module with standard floating-point (IEEE 754) arithmetic, and our approach with hybrid custom floating-point as well as logarithmic approximation. As a design parameter, the mantissa bit-width of the weight vector provides a tunable knob to trade-off between efficiency and quality of result (QoR)[69, 17]. Since the lower-order bits have smaller significance than the higher-order bits, truncating them may have only a minor impact on QoR [70, 71]. Further on, we can remove completely the mantissa bits in order to use only the exponent of a floating-point representation. Therefore, the most efficient setup and yet the worst-case quality configuration becomes a logarithmic representation, which consequently leads to significant architectural-level optimizations using

### 3. Accelerating Spike-by-Spike Neural Networks

---

only adders and shifters for dot-product approximation in hardware. Moreover, since approximations and noise have qualitatively the same effect[72], we apply noise tolerance plots as an intuitive visual measure to provide insights into the quality degradation of SbS networks under approximate processing effects.

Our main contributions are as follows:

- We develop a hardware component for dot-product approximation. To perform the sum of pairwise products of two vectors, this hardware module has the following three design features: (1) the pairwise product is approximated by adding integer exponents and multiplying truncated mantissas, and the sum of products is done by accumulating denormalized integer products with barrel shifters, which increases computational throughput; (2) the synaptic weight vector uses either reduced custom floating-point or logarithmic representation, which reduces memory footprint; and (3) the neuron vector uses either standard or custom floating-point representation, which preserves QoR and overall inference accuracy.
- We address a design exploration with the proposed dot-product approximation using synaptic weight vectors with custom floating-point and logarithmic representation as shown in **Fig. 3.1**. We evaluate inference latency, accuracy degradation, resource utilization and power dissipation. Experimental results demonstrate 20.5 $\times$  latency enhancement versus embedded CPU (ARM Cortex-A9 at 666MHz), and less than 0.5% of accuracy



Figure 3.1.: Dot-product hardware module with (a) standard floating-point (IEEE 754) arithmetic, (b) hybrid custom floating-point approximation, and (c) hybrid logarithmic approximation.

degradation on a handwritten digit recognition task (MNIST). This machine learning task simply provides a proof of concept to demonstrate the feasibility of our approximation technique for SbS neural network accelerators.

- We propose a noise tolerance plot as quality monitor, which serves as an intuitive visual model to provide insights into the accuracy degradation of SbS networks under approximate processing effects.
- Our proposed design for dot-product approximation is adaptable as a building block for other error resilient applications (e.g., image/video processing).

The rest of the paper is organized as follows. Section 4.2 covers the related work; Section ?? introduces the background to SbS networks; Section 4.3 describes the system design and the approximate dot-product hardware module; Section 4.4 presents the experimental results thorough a design exploration flow; Section 4.5 concludes the paper.

To promote the research on SbS networks, our design exploration framework is made available to the public as an open-source project at <https://github.com/YaribNevarez/sbs-framework.git>

## 3.2. Related Work

For efficient neural network computation, two main optimization strategies are used, namely network compression and classical approximate computing[18].

### 3.2.1. Network Compression

Researchers focusing on embedded applications started lowering the precision of weights and activation maps to shrink the memory footprint of the large number of parameters representing ANNs, a method known as network compression or quantization. This practice takes advantage of the intrinsic error-tolerance of neural networks, as well as their ability to compensate for approximation while training. In this way, reduced bit precision causes a small accuracy loss [19, 20, 21, 22].

In hardware development, weight quantization (WQ) has shown up to 2 $\times$  improvement in energy consumption with an accuracy degradation of less than 1% [73, 74]. Some advanced quantization methods yield to binary neural networks (BNNs) allowing the use of XNORs instead of the conventional costly multiply-accumulate circuits (MACs) [22]. In [75], Sun et al. report an accuracy of 98.43% on handwritten digit classification (MNIST) with a simple

### 3. Accelerating Spike-by-Spike Neural Networks

---

BNN. Hence, quantization is a powerful tool for improving the energy efficiency and memory requirements of ANN accelerators, with limited accuracy degradation.

In addition to quantization, network pruning reduces the model size by removing structural portions of the parameters and its associated computations [23, 24]. This method has been identified as an effective technique to improve the efficiency of DNN for applications with limited computational budget[25, 26, 27].

These methods can be used for SNNs as well. In [76], Rathi et al. report up to  $3.1\times$  improvement in energy consumption with an accuracy loss of around 3%. Weight quantization allows the designer to realize a trade-off between the accuracy of the SNN application and efficiency of resources. Approximate computing can also be applied at the neuron level, where irrelevant units are deactivated to reduce the computation cost of the SNNs [77]. This computation skipping can be applied randomly on synapses, training ANNs with stochastic synapses improves generalization, resulting in a better accuracy[78, 79]. Such methods are compatible with SNNs and have been tested both during training [80, 81] and operation [82], and even to define the connectivity between layers [83, 84]. Implementations of spiking neuromorphic systems in FPGA [85] and hardware [86] demonstrated that synaptic stochasticity allows to increase the final accuracy of the networks while reducing memory footprint.

Quantization is therefore a powerful technique to improve energy efficiency and memory requirements of ANN and SNN accelerators, with small accuracy degradation. However, this approach requires quantization-aware training methods that, in some cases, are problematic or even inaccessible, particularly in emerging deep SNN algorithms[68].

#### 3.2.2. Classical Approximate Computing

Approximate computing has been used in a wide range of applications to increase the computational efficiency in hardware[17]. This approach consists of designing processing elements that approximate their computation by employing modified algorithmic logic units [17]. In [87], Kim et al. have shown SNNs using carry skip adders achieving  $2.4\times$  latency enhancement and 43% more energy efficiency, with an accuracy degradation of 0.97% on a handwritten digit classification task (MNIST). Therefore, approximate computing provides important enhancement in energy efficiency and processing speed.

However, as the complexity of the dataset increases, as well as the depth of the network topology, such as ResNet [88] on ImageNet [89], the accuracy degradation becomes more important and may not be negligible anymore [22], especially for critical applications such as autonomous driving. Therefore, it is not certain that network compression techniques and approximate computing are suitable for all applications.

### 3.2.3. Spike-by-Spike Neural Networks Accelerators

Recently, Rotermund et al. demonstrated the feasibility of a neuromorphic SbS IP on a Xilinx Virtex 6 FPGA [39]. It provides a massively parallel architecture, optimized to reduce memory access and suitable for ASIC implementations. Nonetheless, this design is considerably resource-demanding if implemented as a full SbS network in today's embedded technology.

In ??, we presented a cross-platform accelerator framework for design exploration and testing of fully functional SbS network models in embedded systems. As a hardware/software (HW/SW) co-design solution, this framework offers a comprehensive high level embedded software API that allows the construction of scalable sequential SbS networks with configurable hardware acceleration. However, this design works entirely with standard floating-point arithmetic (IEEE 754). This represents a large memory footprint and inadequate computational cost for error resilient applications on resource-limited devices.

In this article, we will use this design exploration framework to investigate approximate computing for efficient deployment of deep SbS networks on resource-limited devices.

## 3.3. System Design

In this section, we present a hardware architecture composed of specialized heterogeneous processing units (PUs) with hybrid custom floating-point and logarithmic dot-product approximation. This approach represents an advantageous design for error resilient applications in resource-constrained devices due to the reduced computational costs and memory footprint. Furthermore, the proposed approach allows the implementation of stationary synaptic weight matrices.

Regarding the software architecture, this is structured as a layered object-oriented application framework written in the C programming language. This offers a comprehensive high level embedded software application programming interface (API) that allows the construction of scalable sequential SbS networks with configurable hardware acceleration. Conceptually this design is modular, reusable, and extensible. The overall structure is depicted in 4.8.

### 3.3.1. Hardware Architecture

As a hardware/software co-design, the system architecture is an embedded CPU+FPGA-based platform, where the acceleration of SbS network computation is based on asynchronous<sup>1</sup> execution of parallel heterogeneous processing units: *Spike* (input layer), *Conv* (convolution), *Pool*

---

<sup>1</sup>The system is synchronous at the circuit level, but the execution is asynchronous in terms of jobs.

### 3. Accelerating Spike-by-Spike Neural Networks

---

(pooling), and *FC* (fully connected). 3.3 illustrates the system overview as a scalable structure. For hyperparameter configuration, each PU uses AXI-Lite interface. For data transfer, each PU uses AXI-Stream interfaces via Direct Memory Access (DMA) allowing data movement with high transfer rate. Each PU asserts an interrupt flag once the job or transaction is complete. This interrupt event is handled by the embedded CPU to collect results and start a new transaction.

The hardware architecture can resize its resource utilization by changing the number of PUs instances prior to the hardware synthesis, this provides scalability with a good trade-off between area and throughput. The dedicated PUs for *Conv* and *FC* implement the proposed dot-product approximation as a system component. The PUs are written in C using Vivado HLS (High-Level Synthesis) tool. In this publication, we illustrate the integration of the approximate dot-product component on the *Conv* processing unit.



Figure 3.2.: System-level overview of the embedded software architecture.



Figure 3.3.: System-level hardware architecture with scalable number of heterogeneous PUs: *Spike*, *Conv*, *Pool*, and *FC*

### 3.3.2. Conv Processing Unit

This hardware module computes the IP dynamics defined by 2.1 and offers two modes of operation: *configuration* and *computation*.

#### Configuration Mode

In this mode of operation, the PU receives and stores in on-chip memory (BRAM) the hyperparameters to compute the IP dynamics:  $\epsilon$  as the epsilon,  $N$  as the length of  $\vec{h}_\mu \in \mathbb{R}^N$ ,  $K \in \mathbb{N}$  as the size of the convolution kernel, and  $H \in \mathbb{N}$  as the number of IPs to process per transaction.  $H$  is the number of IPs forming a layer or a partition.

Additionally, the processing unit also stores in on-chip memory (BRAM) the synaptic weight matrix using a number representation with a reduced memory footprint. Fundamentally, the synaptic weight matrix is defined by  $W \in \mathbb{R}^{K \times K \times M \times N}$  with  $0 \leq W(s_t|j) \leq 1$  and  $\sum_{s_t=0}^{M-1} W(s_t|j) = 1$  [38]. Hence,  $W$  employs only positive normalized real numbers. Therefore,  $W$  is deployed using a reduced floating-point or logarithmic representation as follows:

- Custom floating-point representation. In this case,  $W$  is deployed with a reduced floating-point representation using the user defined bit-width for the exponent and for the mantissa. For example, 4-bit exponent, 1-bit mantissa; as a result: 5-bit custom floating-point. The methodology to determine the required bit-width is described in Section 3.3.3.
- Logarithmic representation. In this case, the synaptic weight matrix is  $W \in \mathbb{N}^{K \times K \times M \times N}$  with positive natural numbers. Since  $0 \leq W(s_t|j) \leq 1$  and  $\sum_{s_t=0}^{M-1} W(s_t|j) = 1$ ,  $W$  has only negative values in the logarithmic domain. Hence, the sign bit is omitted, and the values are represented in its positive form. Therefore,  $W$  is deployed with a representation using the necessary bit-width for the exponent according to the given application. For example, 4-bit exponent. The methodology to determine the required bit-width is described in Section 3.3.3.

In order to deploy different SbS network models, the *Conv* processing units can be configured with different synaptic weight matrices and hyperparameters as required through the embedded software.

#### Computation Mode

In this mode of operation, the PU executes a transaction to process a group of IPs using the previously given hyperparameters and synaptic weight matrix. This process operates in six

### 3. Accelerating Spike-by-Spike Neural Networks

---

stages as shown in 3.4. In the first two stages, the PU receives  $\vec{h}_\mu \in \mathbb{R}^N$ , then the PU calculates the emitted spike, and stores it in  $S^{new} \in \mathbb{N}^H$  (output spike vector). From the third to the fifth stage, the PU receives  $S_t \in \mathbb{N}^{K \times K}$  (input spike matrix), then it computes the update dynamics, and then it dispatches  $\vec{h}_\mu^{new} \in \mathbb{R}^N$  (updated IP). This process repeats for  $H$  number of loops (for each IP of the layer or partition). Finally, the  $S^{new}$  is dispatched.

The computation of the update dynamics (see 3.4(d)) operates in two modular stages: *dot-product* and *neuron update*. First, the *dot-product* module calculates the sum of pairwise products of  $\vec{h}_\mu$  and  $\vec{W}(s_t)$ , each pairwise product is stored as intermediate results. Subsequently, the *neuron update* module calculates 2.1 reusing previous results and parameters.

The calculation of the dot-product of 2.1 represents a considerable computational cost using standard floating-point in non-quantized network models. Fortunately, the pair product of  $h_\mu(j)$  and  $W(s_t|j)$  was defined by us as an approximable factor in the dot-product of 2.1. In the following section, we focus on an optimized dot-product hardware design based on approximate computing.



Figure 3.4.: The *Conv* processing unit and its six stages: (a) receive IP vector, (b) spike firing, (c) receive spike kernel, (d) update dynamics, (e) dispatch new IP vector, (f) dispatch output spike matrix.

### 3.3.3. Dot-Product Hardware Module

This dot-product hardware module is part of an application-specific architecture optimized to approximate the dot-product of arbitrary length, see 3.1. For quality configurability, we parameterized the mantissa bit-width of  $\vec{W}(s_t)$ , which provides a tunable trade-off between resource utilization and QoR. Since the lower-order bits have smaller significance than the higher-order bits, removing them may have only a minor impact on QoR. We designate this as hybrid custom floating-point approximation (see 3.1(b)).

$$r_\mu(s_t) = \sum_{j=0}^{N-1} h_\mu(j) W(s_t|j) \quad (3.1)$$

Further on, we remove the mantissa bits completely in order to use only the exponent of a floating-point representation. Hence, the worst-case quality and yet the most efficient configuration becomes a logarithmic representation. Consequently, this structure leads to advantageous architectural optimizations using only adders and barrel shifters for dot-product approximation in hardware. We designate this as hybrid logarithmic approximation (see 3.1(c)).

In order to determine the required bit-width for the number representation, we use 3.2, 3.3, and 3.4.

$$E_{\min} = \log_2(\min_{\forall i}(W(i))) \quad (3.2)$$

$$N_E = \lceil \log_2(|E_{\min}|) \rceil \quad (3.3)$$

$$N_W = N_E + N_M \quad (3.4)$$

The 3.2 obtains the exponent of the minimum entry value in the synaptic weight matrix. Since  $0 \leq W(s_t|j) \leq 1$  and  $\sum_{s_t=0}^{M-1} W(s_t|j) = 1$ ,  $W$  has only negative values in the logarithmic domain; hence, by searching for the smallest value, we obtain the biggest negative exponent ( $E_{\min}$ ). Then, the 3.3 obtains the necessary bit-width to represent the exponent ( $N_E$ ). Finally, we obtain the total bit-width by incorporating both exponent and mantissa bit-widths in 3.4.  $N_M$  denotes the

### 3. Accelerating Spike-by-Spike Neural Networks

---

mantissa bit-width, this is a knob parameter that is tuned by the designer to trade-off between resource utilization and QoR. The bit-width concept is illustrated in 3.1.

In this section, we will present three pipelined hardware modules with standard floating-point (IEEE 754) computation, hybrid custom floating-point approximation, and hybrid logarithmic approximation.

#### **Dot-Product with Standard Floating-Point Computation**

The hardware module to calculate the dot-product with standard floating-point computation is shown in 3.5. This diagram presents the hardware blocks and their clock cycle schedule. This module loads both  $h_\mu(j)$  and  $W(s|j)$  from BRAM, then the PU executes the pairwise product (3.5(c)) and accumulation (3.5(d)). The intermediate results of  $h_\mu(j)W(s_t|j)$  are stored in BRAM for reuse in the neuron update. The latency in clock cycles of this hardware module is defined by 3.5, where  $N$  is the dot-product length. This latency equation is obtained from the general pipelined hardware latency formula:  $L = (N - 1)II + IL$ , where  $II$  is the initiation interval (3.5(a)), and  $IL$  is the iteration latency (3.5(b)). Both  $II$  and  $IL$  are obtained from the high-level synthesis analysis. The equation for the latency with standard 32-bit floating-point is:

$$L_{f32} = 10N + 9 \quad (3.5)$$

In this design, the high-level synthesis tool infers computational blocks with considerable latency cost for standard floating-point. In the case of floating-point multiplication (3.5(c)), the synthesis infers a hardware block with a latency cost of 5 clock cycles. Theoretically, this block would handle exponents addition, mantissas multiplication, and mantissa correction if needed. Moreover, in the case of floating-point addition (3.5(d)), the synthesis infers a hardware block with a latency cost of 9 clock cycles. Seemingly, this block would handle mantissas alignment, addition, and correction if needed. Therefore, the use of standard floating-point in high-level synthesis results in high computational cost, which represents unnecessary overhead in error tolerant applications.

#### **Dot-Product with Hybrid Custom Floating-Point and Logarithmic Approximation**

The hardware module to calculate dot-product with hybrid custom floating-point approximation is shown in 3.6. In this design,  $h_\mu(j)$  uses standard 32-bit floating-point number representation, and  $W(s|j)$  uses a positive reduced custom floating-point number representation, where the mantissa bit width is the quality configurability knob. This parameter is tuned by the designer to trade-off between QoR and resource utilization, thus, energy consumption.

As the most efficient setup and yet the worst-case quality configuration, by completely truncating the mantissa of  $W(s|j)$  leads to a slightly different hardware architecture using only adders and shifters, which computes the dot-product with hybrid logarithmic approximation. This is shown in 3.7.

Additionally, the exponent bit-width of  $W(s|j)$  is a design parameter for efficient resource utilization and it is defined based on the application or deployment needs.

The hybrid custom floating-point and logarithmic approximation designs work in three phases: *Computation*, *Threshold-test*, and *Result normalization*.

- Phase I, *Computation*:

This phase approximates the magnitude of the dot-product in a denormalized representation. This is calculated in two iterative steps over each vector element: *pairwise product* and *accumulation*, where *pairwise product* is executed either in hybrid custom floating-point or hybrid logarithmic approximation described below.

- Pairwise product.

- Hybrid custom floating-point approximation. As shown in 3.6(c) in dark-gray, the pairwise product is approximated by adding exponents and multiplying



Figure 3.5.: Dot-product hardware module with standard floating-point (IEEE 754) computation, (a) exhibits the initiation interval of 10 clock cycles, (b) presents the iteration latency of 19 clock cycles, (c) shows the pairwise product block in dark-gray, and (d) illustrates the accumulation block in light-gray.

mantissas of both  $W(s|i)$  and  $h_\mu(i)$ . If the mantissa multiplication results in an overflow, then it is corrected by increasing the exponent and shifting the resulting mantissa by one position to the right. Then we get  $h_\mu(j)W(s_t|j)$  as an intermediate result which is stored for future reuse in the neuron update calculation. In this design the pairwise product has a latency of 5 clock cycles.

- Hybrid logarithmic approximation. As shown in 3.7(c) in dark-gray, the pairwise product is approximated by adding  $W(s|i)$  to the exponent of  $h_\mu(i)$ , since  $W(s|j)$  values are represented in the logarithmic domain and  $h_\mu(j)$  in standard floating-point. In this design the pairwise product has a latency of one clock cycle.
- Accumulation. As shown in both 3.6(d) and 3.7(d) in light-gray, first, it is obtained the denormalized representation of  $h_\mu(j)W(s_t|j)$  by shifting its mantissa using its exponent as shifting parameter (barrel shifter). Then, this denormalized representation is accumulated to obtain the approximated magnitude of the dot-product.

The process of pairwise product and accumulation iterates over each element of the vectors. The computation latency is given by 3.6 for hybrid custom floating-point, and 3.7 for hybrid logarithmic, where  $N$  is the length of the vectors. Both pipelined hardware modules have the same throughput, since both have two clock cycles as initiation interval.

$$L_{custom} = 2N + 11 \quad (3.6)$$

$$L_{log} = 2N + 7 \quad (3.7)$$

- Phase II, *Threshold-test*:

The accumulated denormalized magnitude is tested to be above of a predefined threshold, it must be above zero, since the dot-product is the denominator in 2.1. If passing the threshold, then the next phase is executed. Otherwise the rest of update dynamics is skipped. The threshold-test takes one clock cycle.

- Phase III, *Result-normalization*:

In this phase, the dot-product is normalized to obtain the exponent and mantissa in order to convert it to standard floating-point for later use in the neuron update. The normalization is obtained by shifting the approximated dot-product magnitude in a loop until it is in the form of a normalized mantissa where the iteration count represents the exponent of the dot-product. Each iteration takes one clock cycle.

The total latency of the hardware module with hybrid custom floating-point and hybrid logarithmic approximation is the accumulated latency of the three phases.

The proposed architectures with approximation approach exceeds the performance of the design with standard floating-point. This performance enhancement is achieved by decomposing the floating-point computation into an advantageous handling of exponent and mantissa using intermediate accumulation in a denormalized representation and only one final normalization.



Figure 3.6.: Dot-product hardware module with hybrid custom floating-point approximation, (a) exhibits the initiation interval of 2 clock cycles, (b) presents the iteration latency of 13 clock cycles, (c) shows the pairwise product blocks in dark-gray, and (d) illustrates the accumulation blocks in light-gray.

## 3.4. Experimental Results

The proposed architecture is demonstrated on a Xilinx Zynq-7020. This device integrates a dual ARM Cortex-A9 based processing system (PS) and programmable logic (PL) equivalent to Xilinx Artix-7 (FPGA) in a single chip [90]. The Zynq-7020 architecture conveniently maps the custom logic and software in the PL and PS respectively as an embedded system.

In this platform, we implement the proposed hardware architecture to deploy the SbS network structure shown in 2.1 for handwritten digit classification task using MNIST data set. The SbS model is trained in Matlab without any quantization method, using standard floating-point. The resulting synaptic weight matrices are deployed on the embedded system. There, the SbS

### 3. Accelerating Spike-by-Spike Neural Networks



Figure 3.7.: Dot-product hardware module with hybrid logarithmic approximation, (a) exhibits the initiation interval of 2 clock cycles, (b) presents the iteration latency of 9 clock cycles, (c) shows the pairwise product block in dark-gray, and (d) illustrates the accumulation blocks in light-gray.

network is built as a sequential model using the API from the SbS embedded software framework [45]. This API allows to configure the computational workload of the neural network, which can be distributed among the hardware processing units and the embedded CPU.

For the evaluation of our approach, we address a design exploration by reviewing the computational latency, inference accuracy, resource utilization, and power dissipation. First, we benchmark the performance of SbS network simulation on the embedded CPU, and then repeat the measurements on hardware processing units with standard floating-point computation. Afterwards, we evaluate our dot-product architecture, addressing a design exploration with hybrid custom floating-point approximation, as well as the hybrid logarithmic approximation. Finally, we present a discussion of the presented results.

#### 3.4.1. Performance Benchmark

##### Benchmark on Embedded CPU

We examine the performance of the CPU for SbS network simulation with no hardware co-processing. In this case, the embedded software builds the SbS network as a sequential model mapping the entire computation to the CPU (ARM Cortex-A9) at 666 MHz and a power dissipation of 1.658W.

The SbS network computation on the CPU achieves a latency of 34.28ms per spike with an

Table 3.1.: Computation on embedded CPU.

| Layer   | Latency (ms) |
|---------|--------------|
| HX_IN   | 1.184        |
| H1_CONV | 4.865        |
| H2_POOL | 3.656        |
| H3_CONV | 20.643       |
| H4_POOL | 0.828        |
| H5_FC   | 3.099        |
| HY_OUT  | 0.004        |
| TOTAL   | 34.279       |

accuracy of 99.3% correct classification on the 10,000 image test set with 1000 spikes. The latency and schedule of the SbS network computation are displayed in 3.1 and 3.8 respectively.

### Benchmark on Processing Units with Standard Floating-Point Computation

To benchmark the computation on hardware PUs with standard floating-point, we implement the system architecture shown in 3.9. In this case, the embedded software builds the SbS network as a sequential model mapping the network computation to the hardware processing units at 200 MHz as clock frequency.

The layers of the neural network with the most neurons are partitioned for asynchronous



Figure 3.8.: Computation on embedded CPU.



Figure 3.9.: System overview of the top-level architecture with 8 processing units.

### 3. Accelerating Spike-by-Spike Neural Networks

---

Table 3.2.: Performance of processing units with standard floating-point (IEEE 754) computation.

| Hardware mapping |       | Computation schedule (ms) |           |          |       |
|------------------|-------|---------------------------|-----------|----------|-------|
| Layer            | PU    | $t_s$                     | $t_{CPU}$ | $t_{PU}$ | $t_f$ |
| HX_IN            | Spike | 0                         | 0.056     | 0.370    | 0.426 |
| H1_CONV          | Conv1 | 0.058                     | 0.598     | 2.002    | 2.658 |
| H2_POOL          | Pool1 | 0.658                     | 0.126     | 1.091    | 1.875 |
|                  | Pool2 | 0.785                     | 0.125     | 1.075    | 1.985 |
| H3_CONV          | Conv2 | 0.911                     | 0.280     | 3.183    | 4.374 |
|                  | Conv3 | 1.193                     | 0.279     | 3.176    | 4.648 |
| H4_POOL          | Pool3 | 1.473                     | 0.037     | 0.481    | 1.991 |
| H5_FC            | FC    | 1.512                     | 0.101     | 1.118    | 2.731 |
| HY_OUT           | CPU   | 1.615                     | 0.004     | 0        | 1.619 |

parallel processing. Since  $H2\_POOL$  and  $H3\_CONV$  are the layers with the most neurons, the computational workload is distributed between two PUs for each one of these layers. The output layer  $HY\_OUT$  is fully processed by the CPU, since it is the layer with fewest neurons. The hardware mapping and the computation schedule of this deployment are displayed in 3.2 and 3.10.

In the computation schedule, the following terms are defined as follows:  $t_s(n)$  as the start time for the processing of the neural network layer (as a compute node)  $n \in L$  where  $L$  represents the set of layers;  $t_{CPU}(n)$  as the CPU preprocessing time;  $t_{PU}(n)$  as the PU latency; and  $t_f(n)$  as the finish time. For data preparation, the  $t_{CPU}(n)$  is the duration in which the CPU writes a DRAM buffer with  $\vec{h}_\mu$  (vector of neuron latent variables) of the current processing layer and  $S_t$  (input spike matrix) from its preceding layer. This buffer is streamed to the PU via DMA.

The total execution time of the CPU is defined by 3.8. In a cyclic spiking inference, the execution time of the network computation is the longest path among the processing units including the CPU. This is denoted as the latency of an spike cycle and it is defined by 3.10. The total execution time of the network computation is the last finish time ( $t_f$ ) in the schedule defined by 3.11.



Figure 3.10.: Performance of processing units with standard floating-point (IEEE 754) computation.

$$T_{CPU} = \sum_{n \in L} t_{CPU}(n) \quad (3.8)$$

$$T_{PU} = \max_{n \in L} (t_{PU}(n)) \quad (3.9)$$

$$T_{SC} = \begin{cases} T_{PU}, & \text{if } T_{CPU} \leq T_{PU} \\ T_{CPU}, & \text{otherwise} \end{cases} \quad (3.10)$$

$$T_f = \max_{n \in L} (t_f(n)) \quad (3.11)$$

Using standard floating-point requires a high computational cost. As the largest layer, the computational workload of *H3\_CONV* is evenly partitioned among two PUs: *Conv2* and *Conv3*. However, in the cyclic schedule, *Conv2* causes the performance bottleneck as shown in 3.11. In this case, the CPU has to await for *Conv2* to finish the computation of the previous cycle in order to start the current computation cycle. In contrast, as the smallest layer, the computational workload of *HY\_OUT* is fully processed by the CPU. 3.2 and 3.10 show 4 $\mu$ s as the processing



Figure 3.11.: Performance bottleneck of cyclic computation on processing units with standard floating-point (IEEE 754) arithmetic, (a) exhibits the starting of  $t_{PU}$  of *Conv2* on a previous computation cycle, (b) presents  $t_{CPU}$  of *Conv2* on the current computation cycle, (c) shows the CPU waiting time (in gray color) for *Conv2* as a busy resource (awaiting for *Conv2* interruption), and (d) illustrates the  $t_f$  from the previous computation cycle, the starting of  $t_{PU}$  on the current computation cycle (*Conv2* interruption on completion, and start current computation cycle).

### 3. Accelerating Spike-by-Spike Neural Networks

---

latency of  $HY\_OUT$ . This latency is negligible compared to the overall performance assessment. Accelerating  $HY\_OUT$  would yield a negligible gain. Moreover, assigning a dedicated hardware PU to  $HY\_OUT$  would add data transfer and hardware interruption handling overheads, which makes this unprofitable.

Applying 3.10, we obtain a latency of  $3.18ms$  per spike cycle. This deployment achieves an accuracy of 98.98% correct classification on the 10,000 image test set with 1000 spikes.

The post-implementation resource utilization and power dissipation are shown in 3.3. Each *Conv* PU instantiates an on-chip stationary weight matrix of 52,000 entries, which is sufficient to store  $W \in \mathbb{R}^{5 \times 5 \times 2 \times 32}$  and  $W \in \mathbb{R}^{5 \times 5 \times 32 \times 64}$  for *H1\_CONV* and *H3\_CONV*, respectively. In order to reduce BRAM utilization, we use a custom floating-point representation composed of 4-bit exponent and 4-bit mantissa. Each 8-bit entry is promoted to its standard floating-point representation for the dot-product computation. The methodology to find the appropriate bit-width parameters for custom floating-point representation is presented in Section 3.4.2.

Table 3.3.: Resource utilization and power dissipation of processing units with standard floating-point (IEEE 754) computation.

| PU    | LUT   | FF    | DSP | BRAM 18K | Power (mW) |
|-------|-------|-------|-----|----------|------------|
| Spike | 2,640 | 4,903 | 2   | 2        | 38         |
| Conv  | 2,765 | 4,366 | 19  | 37       | 89         |
| Pool  | 2,273 | 3,762 | 5   | 3        | 59         |
| FC    | 2,649 | 4,189 | 8   | 9        | 66         |

The implementation of dot-product with standard floating-point arithmetic (IEEE 754) utilizes proprietary multiplier and adder floating-point operator cores. Vivado HLS accomplishes floating-point arithmetic operations by mapping them onto Xilinx LogiCORE IP cores, these floating-point operator cores are instantiated in the resultant RTL[91]. In this case, the implementation of the dot-product with the standard floating-point computation reuses the multiplier and adder cores already instantiated and used in other computation sections of *Conv* and *FC* processing units. The post-implementation resource utilization and power dissipation of the floating-point operator cores are shown in 4.3.

Table 3.4.: Resource utilization and power dissipation of multiplier and adder floating-point (IEEE 754) operator cores.

| Core operation | DSP | FF  | LUT | Latency (clk) | Power (mW) |
|----------------|-----|-----|-----|---------------|------------|
| Multiplier     | 3   | 151 | 325 | 4             | 7          |
| Adder          | 2   | 324 | 424 | 8             | 6          |

### Benchmark on Noise Tolerance Plot

The noise tolerance plot serves as an intuitive visual model used to provide insights into accuracy degradation under approximate processing effects. This plot reveals inherent error resilience, and hence, approximation resilience. As an application-specific quality metric, this plot offers an effective method to estimate the overall quality degradation of the SbS network under different approximate processing effects, since both approximations and noise have qualitatively the same effect[72].

In order to experimentally obtain the noise tolerance plot, we measure the inference accuracy of the neural network with increasing number of spikes. Then we repeat the measurements with uniformly distributed noise applied on the input images. We gradually ascend the levels of the noise amplitude, until accuracy degradation is detected. 3.12 demonstrates this method using 100 sample images.

As benchmark, the tolerance plot in 3.12 revels accuracy degradation having 50% noise and convergence with 400 spikes. In this case, the particular SbS network with precise processing demonstrates a remarkable inherent error resilience, hence, a great opportunity for approximate processing.

### 3.4.2. Design Exploration with Hybrid Custom Floating-Point and Logarithmic Approximation

In this section, we address a design exploration to evaluate our approach for SbS neural network simulation using hybrid custom floating-point and logarithmic approximation. First, we examine the synaptic weight matrix of each SbS network layer in order to determine the minimum requirements for numeric representation and memory storage. Second, we implement the proposed dot-product architecture using the minimal floating-point and logarithmic representation as design parameters. Finally, we evaluate the overall performance, the inference accuracy, the resource utilization, and the power dissipation.

#### Parameters for Numeric Representation of Synaptic Weight Matrix

We obtain information for the numerical representation of the synaptic weight matrices from their  $\log_2$ -histograms presented in 3.13. These histograms show the distribution of synaptic weight values in each matrix. We observe that the minimum integer exponent value is  $-13$ . Hence, applying 3.2 and 3.3 to the given SbS network, we obtain  $E_{\min} = -13$  and  $N_E = 4$ , respectively. Therefore, 4-bits are required for the absolute binary representation of the exponents.

### 3. Accelerating Spike-by-Spike Neural Networks

---

For quality configurability, the mantissa bit-width is a knob parameter that is tuned by the designer. This procedure leverages the builtin error-tolerance of neural networks and performs a trade-off between resource utilization and QoR. In the following subsection, we present a case study with 1-bit mantissa corresponding to the custom floating-point approximation.

#### Design Exploration for Dot-product with Hybrid Custom Floating-Point Approximation

For this design exploration, we use a custom floating-point representation composed of 4-bit exponent and 1-bit mantissa. This format is used for the synaptic weight vector on the proposed dot-product architecture. Each *Conv* PU instantiates an on-chip stationary weight matrix for



Figure 3.12.: Noise tolerance on hardware PU with standard floating-point (IEEE 754) computation (benchmark/reference), (a) exhibits accuracy degradation applying 50% of noise amplitude, and (b) illustrates convergence of inference with 400 spikes.

52,000 entries of 5-bit. The available memory size is large enough to store  $W \in \mathbb{R}^{5 \times 5 \times 2 \times 32}$  and  $W \in \mathbb{R}^{5 \times 5 \times 32 \times 64}$  for  $H1\_CONV$  and  $H3\_CONV$ , respectively. The same dot-product architecture is implemented in the processing unit of the fully connected layer ( $FC$ ). However, due to lack of BRAM resources, this PU can not instantiate on-chip stationary synaptic weight matrix. Instead,  $FC$  receives the  $\vec{W}(s_t)$  (weight vectors) during operation as well as  $\vec{h}_\mu$  and  $S_t$ . The hardware mapping and the computation schedule of this implementation are displayed in 3.6 and 3.14.

As shown in the computation schedule in 3.6 and 3.14, this implementation achieves a maximum hardware PU latency of  $1.30ms$  according to 3.9, and a CPU latency of  $1.67ms$ . Therefore, applying 3.10, we obtain a latency of  $1.67ms$  per spike cycle as shown in 3.14. In this case, the cyclic bottleneck is in the performance of the CPU.

This configuration achieves an accuracy of 98.97% correct classification on the 10,000 image test set with 1000 spikes. This indicates an accuracy degradation of 0.33%. For output quality monitoring, the noise tolerance plot in 3.15 revels accuracy degradation for noise higher than 50% on the input images, and convergence of inference with 400 spikes. Thus, the particular



Figure 3.13.:  $\log_2$ -histogram of each synaptic weight matrix showing the percentage of matrix elements with given integer exponent.

### 3. Accelerating Spike-by-Spike Neural Networks

---

SbS network implementation under approximate processing effects demonstrates a minimal impact on the overall accuracy. This proves an inherent error resilience, and hence, remaining approximation budget.

The post-implementation resource utilization and power dissipation are shown in 3.5 .

Table 3.5.: Resource utilization and power dissipation of processing units with hybrid custom floating-point approximation.

| PU   | LUT   | FF    | DSP | BRAM 18K | Power (mW) |
|------|-------|-------|-----|----------|------------|
| Conv | 3,139 | 4,850 | 19  | 25       | 82         |
| FC   | 3,265 | 5,188 | 8   | 9        | 66         |

### Design Exploration for Dot-Product whit Hybrid Logarithmic Approximation

As the most efficient setup and yet the worst-case quality configuration, we use a 4-bit integer exponent for logarithmic representation of the synaptic weight matrix. Each *Conv* processing unit implements the proposed dot-product architecture including an on-chip stationary weight matrix for 52,000 entries of 4-bit integer each one to store  $W \in \mathbb{N}^{5 \times 5 \times 2 \times 32}$  and  $W \in \mathbb{N}^{5 \times 5 \times 32 \times 64}$  for *H1\_CONV* and *H3\_CONV*, respectively. The same dot-product architecture is implemented



Figure 3.14.: Performance on processing units with hybrid custom floating-point approximation, (a) exhibits computation schedule, (b) presents cyclic computation schedule, and (c) shows the performance of *Conv2* from a previous computation cycle during the preprocessing of *H1\_CONV* on the current computation cycle without bottleneck.

Table 3.6.: Performance of hardware processing units with hybrid custom floating-point approximation.

| Hardware mapping |       | Computation schedule (ms) |           |          |       |
|------------------|-------|---------------------------|-----------|----------|-------|
| Layer            | PU    | $t_s$                     | $t_{CPU}$ | $t_{PU}$ | $t_f$ |
| HX_IN            | Spike | 0                         | 0.055     | 0.307    | 0.362 |
| H1_CONV          | Conv1 | 0.057                     | 0.654     | 1.309    | 2.020 |
| H2_POOL          | Pool1 | 0.713                     | 0.131     | 1.098    | 1.942 |
|                  | Pool2 | 0.845                     | 0.125     | 1.098    | 2.068 |
| H3_CONV          | Conv2 | 0.972                     | 0.285     | 1.199    | 2.456 |
|                  | Conv3 | 1.258                     | 0.279     | 1.184    | 2.721 |
| H4_POOL          | Pool3 | 1.538                     | 0.037     | 0.484    | 2.059 |
| H5_FC            | FC    | 1.577                     | 0.091     | 0.438    | 2.106 |
| HY_OUT           | CPU   | 1.669                     | 0.004     | 0        | 1.673 |

Table 3.7.: Performance of hardware processing units with hybrid logarithmic approximation.

| Hardware mapping |       | Computation schedule (ms) |           |          |       |
|------------------|-------|---------------------------|-----------|----------|-------|
| Layer            | PU    | $t_s$                     | $t_{CPU}$ | $t_{PU}$ | $t_f$ |
| HX_IN            | Spike | 0                         | 0.055     | 0.264    | 0.319 |
| H1_CONV          | Conv1 | 0.057                     | 0.655     | 1.271    | 1.983 |
| H2_POOL          | Pool1 | 0.714                     | 0.130     | 1.074    | 1.918 |
|                  | Pool2 | 0.845                     | 0.126     | 1.106    | 2.077 |
| H3_CONV          | Conv2 | 0.973                     | 0.285     | 1.179    | 2.437 |
|                  | Conv3 | 1.258                     | 0.278     | 1.176    | 2.712 |
| H4_POOL          | Pool3 | 1.538                     | 0.037     | 0.488    | 2.063 |
| H5_FC            | FC    | 1.577                     | 0.091     | 0.388    | 2.056 |
| HY_OUT           | CPU   | 1.669                     | 0.004     | 0        | 1.673 |

in the *FC* processing unit without stationary synaptic weight matrix. The hardware mapping and the computation schedule of this implementation are displayed in 3.7 and 3.16.

As shown in the computation schedule in 3.7 and 3.16, this implementation achieves a maximum hardware PU latency of  $1.27\text{ms}$  according to 3.9, and a CPU latency of  $1.67\text{ms}$ . Therefore, applying 3.10, we obtain a latency of  $1.67\text{ms}$  per spike cycle as shown in 3.16. In this case, the cyclic bottleneck is in the CPU performance.

This quality configuration achieves an accuracy of 98.84% correct classification on the 10,000 image test set with 1000 spikes. This indicates an accuracy degradation of 0.46%. For output quality monitoring, the noise tolerance plot in 3.17 revels accuracy degradation having 40% noise on the input images, and convergence of inference with 600 spikes. The particular SbS network implementation under approximate processing demonstrates a minor impact on the overall accuracy. This exhibits remaining budget for further approximate processing approaches.

The post-implementation resource utilization and power dissipation are shown in 3.8 .

### 3. Accelerating Spike-by-Spike Neural Networks

---

Table 3.8.: Resource utilization and power dissipation of processing units with hybrid logarithmic approximation.

| PU   | LUT   | FF    | DSP | BRAM 18K | Power (mW) |
|------|-------|-------|-----|----------|------------|
| Conv | 3,086 | 4,804 | 19  | 21       | 78         |
| FC   | 3,046 | 4,873 | 8   | 8        | 66         |

#### 3.4.3. Results and Discussion

As a reference, the SbS network simulation on embedded CPU using standard 32-bit floating-point achieves an accuracy of 99.3% with a latency of  $T_{SC} = 34.28ms$ . As a second reference



Figure 3.15.: Noise tolerance on hardware PU with custom floating-point approximation, (a) exhibits accuracy degradation applying 50% of noise amplitude, and (b) illustrates convergence of inference with 400 spikes.

point, the network simulation on hardware processing units with standard floating-point achieves an accuracy of 98.98% with a latency  $T_{SC} = 3.18ms$ . As result we get a  $10.7\times$  latency enhancement and an accuracy degradation of 0.32%. The tolerance plot in 3.12 reveals accuracy degradation having 50% noise on the input images, and convergence of inference with 400 spikes. In this case, the SbS network deployment with precise computing proves extraordinary inherent error resilience, and hence, this represents a great potential for approximate processing.

As a demonstration of the proposed dot-product architecture, the SbS network simulation on hardware PUs with synaptic representation using 5-bit custom floating-point (4-bit exponent, 1-bit mantissa) and 4-bit logarithmic (4-bit exponent) achieve  $20.5\times$  latency enhancement and accuracy of 98.97% and 98.84%, respectively. This results in an accuracy degradation of 0.33% and 0.46%, respectively. For output quality monitoring, the noise tolerance plot in 3.15 and 3.17 reveal accuracy degradation when having 50% and 40% noise on the input images, and convergence of inference with 400 and 600 spikes, respectively. Therefore, the design exploration under the proposed approximate computing approach indicates sufficient inherent error resilience for further and more aggressive approximate processing approaches.

Regarding resource utilization and power dissipation with the proposed approach, the *Conv* processing units have a 43.24% reduction of BRAM, and a 12.35% of improvement in energy efficiency over the standard floating-point implementation. However, the proposed approach does not reuse the available floating-point operator cores instantiated from other computational



Figure 3.16.: Performance of processing units with hybrid logarithmic approximation, (a) exhibits computation schedule, and (b) illustrates cyclic computation schedule.

sections (see 4.3). Therefore, the logic required for the dot-product must be implemented, which is reflected as additional utilization of LUT and FF resources. The experimental results of the design exploration are summarized in 3.9. The platform implementations are summarized in 3.10, and their power dissipation breakdowns are presented in 3.18.

### 3.5. Conclusions

In this work, we accelerate SbS neural networks with a dot-product functional unit based on approximate computing that combines the advantages of custom floating-point and logarithmic



Figure 3.17.: Noise tolerance on hardware PU with hybrid logarithmic approximation, (a) exhibits accuracy degradation applying 40% of noise amplitude, (b) illustrates convergence of inference with 600 spikes.

Table 3.9.: Experimental results.

| Dot-product implementation                       | PU   | Post-implementation resource utilization |       |     |          | Power (mW) | Latency       |                   | Accuracy |
|--------------------------------------------------|------|------------------------------------------|-------|-----|----------|------------|---------------|-------------------|----------|
|                                                  |      | LUT                                      | FF    | DSP | BRAM 18K |            | $T_{SC}$ (ms) | Gain <sup>d</sup> |          |
| Standard floating-point computation <sup>a</sup> | Conv | 2,765                                    | 4,366 | 19  | 37       | 89         | 3.18          | 10.7x             | 98.98%   |
|                                                  | FC   | 2,649                                    | 4,189 | 8   | 9        | 66         |               |                   |          |
| Hybrid custom floating-point approx <sup>b</sup> | Conv | 3,139                                    | 4,850 | 19  | 25       | 82         | 1.67          | 20.5x             | 98.97%   |
|                                                  | FC   | 3,265                                    | 5,188 | 8   | 9        | 66         |               |                   |          |
| Hybrid logarithmic approximation <sup>c</sup>    | Conv | 3,086                                    | 4,804 | 19  | 21       | 78         | 1.67          | 20.5x             | 98.84%   |
|                                                  | FC   | 3,046                                    | 4,873 | 8   | 8        | 66         |               |                   |          |

<sup>a</sup> Reference with standard floating-point arithmetic (IEEE 754).

<sup>b</sup> Synaptic weight with number representation composed of 4-bit exponent and 1-bit mantissa.

<sup>c</sup> Synaptic weight with number representation composed of 4-bit exponent.

<sup>d</sup> Acceleration with respect to the computation on embedded CPU (ARM Cortex-A9 at 666 MHz) with latency  $T_{SC} = 34.28\text{ms}$ .

<sup>e</sup> Accuracy on 10,000 image test set with 1000 spikes.

Table 3.10.: Platform implementations.

| Platform implementation                                      | Post-implementation resource utilization |        |     |          | Power (W) | Clock (MHz) | Latency       |                   | Accuracy |
|--------------------------------------------------------------|------------------------------------------|--------|-----|----------|-----------|-------------|---------------|-------------------|----------|
|                                                              | LUT                                      | FF     | DSP | BRAM 18K |           |             | $T_{SC}$ (ms) | Gain <sup>f</sup> |          |
| ?? <sup>a</sup>                                              | 42,740                                   | 57,118 | 49  | 92       | 2.519     | 250         | 4.65          | 7.4x              |          |
| This work (standard floating-point computation) <sup>b</sup> | 39,514                                   | 56,036 | 82  | 180      | 2.420     | 200         | 3.18          | 10.7x             |          |
| This work (hybrid custom floating-point approx) <sup>c</sup> | 42,021                                   | 58,759 | 82  | 156      | 2.369     | 200         | 1.67          | 20.5x             |          |
| This work (hybrid logarithmic approximation) <sup>d</sup>    | 41,060                                   | 57,862 | 82  | 148      | 2.324     | 200         | 1.67          | 20.5x             |          |

<sup>a</sup> Reference architecture with homogeneous AUs using standard floating-point arithmetic (IEEE 754).

<sup>b</sup> Reference architecture with specialized heterogeneous PUs using standard floating-point arithmetic (IEEE 754).

<sup>c</sup> Proposed architecture with specialized heterogeneous PUs using synaptic weight with number representation composed of 4-bit exponent and 1-bit mantissa.

<sup>d</sup> Proposed architecture with specialized heterogeneous PUs using synaptic weight with number representation composed of 4-bit exponent.

<sup>e</sup> Acceleration with respect to the computation on embedded CPU (ARM Cortex-A9 at 666 MHz) with latency  $T_{SC} = 34.28\text{ms}$ .

<sup>f</sup> Accuracy on 10,000 image test set with 1000 spikes.

representations. This approach reduces computational latency, memory footprint, and power dissipation while preserving classification accuracy. For output quality monitoring, we applied noise tolerance plots as an intuitive visual measure to provide insights into the accuracy degra-



Figure 3.18.: Power dissipation breakdown of platform implementations, (a) ?? architecture with homogeneous AUs using standard floating-point arithmetic (IEEE 754), (b) reference architecture with specialized heterogeneous PUs using standard floating-point arithmetic (IEEE 754), (c) proposed architecture with hybrid custom floating-point approximation, and (d) proposed architecture with hybrid logarithmic approximation.

### *3. Accelerating Spike-by-Spike Neural Networks*

---

dation of SbS networks under different approximate processing effects. This plot revels inherent error resilience, hence, the possibilities for approximate processing.

We demonstrate our approach using a design exploration flow on a Xilinx Zynq-7020 with a deployment of SbS network for the MNIST classification task. This implementation achieves up to  $20.5\times$  latency enhancement,  $8\times$  weight memory footprint reduction, and 12.35% of energy efficiency improvement over the standard floating-point hardware implementation, and incurs in less than 0.5% of accuracy degradation. Furthermore, with a noise amplitude of 50% added on top of the input images, the SbS network presents an accuracy degradation of less than 5%. As output quality monitor, the resulting noise tolerance plots demonstrate a sufficient QoR for minimal impact on the overall accuracy of the neural network under the effects of the proposed approximation technique. These results suggest available room for further and more aggressive approximate processing approaches.

In summary, based on the relaxed need for fully accurate or deterministic computation of SbS neural networks, approximate computing techniques allow substantial enhancement in processing efficiency with moderated accuracy degradation.

---

# 4. Accelerating Convolutional Neural Networks

---

|                                          |           |
|------------------------------------------|-----------|
| <b>4.1. Introduction</b>                 | <b>49</b> |
| <b>4.2. Related Work</b>                 | <b>52</b> |
| 4.2.1. Hybrid Custom Floating-Point      | 52        |
| 4.2.2. Low-Precision Floating-Point      | 53        |
| 4.2.3. Low-Power Hardware Architectures  | 53        |
| <b>4.3. System Design</b>                | <b>53</b> |
| 4.3.1. Base Embedded System Architecture | 53        |
| 4.3.2. Tensor Processor                  | 54        |
| 4.3.3. Training Method                   | 59        |
| 4.3.4. Embedded software architecture    | 61        |
| <b>4.4. Experimental Results</b>         | <b>63</b> |
| 4.4.1. Sensor Analytics Application      | 63        |
| 4.4.2. Training                          | 66        |
| 4.4.3. Hardware Design Exploration       | 70        |
| 4.4.4. Discussion                        | 73        |
| <b>4.5. Conclusions</b>                  | <b>77</b> |

---

## 4.1. Introduction

There is a growing demand for ubiquitous AI sensor analytics. Industry 4.0 and smart city infrastructure leverage AI solutions to increase productivity and adaptability[92]. These solutions

#### *4. Accelerating Convolutional Neural Networks*

---

are powered by advances in ML, compute engines, and big data. Hence, improvements of these should be considered for research, as they are the machinery of the future.

Convolutional neural networks (CNNs) represent the essential building blocks in 2D pattern analytics. Sensor-based applications such as mechanical fault diagnosis[48, 49], structural health monitoring[50], human activity recognition (HAR) [51], hazardous gas detection[52] have been powered by CNN models in industry and academia. CNN-based models, as one of the main types of artificial neural networks (ANNs), have been widely used in sensor analytics with automatic learning from sensor data [93, 94, 95, 96]. In this context, CNN models are applied for automatic feature learning, usually, from 1D time series as well as for 2D time-frequency spectrograms. CNN models provide advantages such as local dependency, scale invariance, and noise resilience in analytics[31]. However, these models are computationally intensive and power-hungry. This is particularly challenging for low-power embedded applications in the field of Internet-of-Things (IoT).

For ML inference, dedicated hardware architectures are typically used to enhance compute performance and power efficiency. In terms of computational throughput, graphics processing units (GPUs) offer the highest performance; in terms of power efficiency, ASIC and FPGA solutions are more energy efficient [97]. As a result, numerous commercial ASIC and FPGA accelerators have been proposed, targeting both high performance computing (HPC) for data-centers and embedded systems applications.

However, most FPGA accelerators have been implemented to target mid- to high-range FPGAs for computationally intensive CNN models such as AlexNet, VGG-16, and ResNet-18. The main drawbacks of these implementations are power supply demands, physical dimensions, heat sink requirements, air cooling, and a resulting high price. In some cases, these implementations are not feasible for ubiquitous low-power/resource-constrained applications.

To reduce the compute hardware for CNN inference there are two types of research [98]: the first one is deep compression including weight pruning, weight quantization, and compression storage [20, 99]; the second type of research corresponds to a more efficient data representation, also known as custom quantization for dedicated hardware implementation. In this group, hardware implementations with customized 8-bit floating-point computation have been proposed [100, 98, 101]. However, these architectures are inadequate for embedded applications, the target devices are high-end FPGA and PCIe devices.

Reducing the compute hardware with more aggressive quantization such as binary [19], ternary [53], and mixed precision (2-bit activations and ternary weights) [54] typically incur significant accuracy degradation for very low precisions, especially for complex problems[55].

In this paper, we present the Hybrid-Float6 quantization and its dedicated hardware design.

In this concept, feature maps are represented by standard 32-bit FP and trainable parameters by 6-bit FP. To preserve accuracy, we introduce a quantization-aware training (QAT) method. To preserve model accuracy, we present a quantization-aware training method, which in some cases improves model performance. For ML compatibility/portability, the 6-bit FP is wrapped into the standard FP representation. The dedicated hardware design extracts the 6-bit format automatically and performs the computation. We propose a parameterized tensor processor implementing a pipelined vector dot-product with HF6. The 6-bit FP representation uses 4-bit exponent and 1-bit mantissa. This approach enables an optimized MAC design by reducing the mantissa multiplication to a mux-adder operation. We leverage the intrinsic error tolerance of ANN to further reduce the hardware design with approximation. This approach reduces latency, resource utilization, and power dissipation. The embedded hardware/software architecture is integrated with TensorFlow Lite using delegate interface to accelerate *Conv2D* tensor operations. We evaluate the applicability of our approach with a CNN-regression model and hardware design exploration for sensor analytics of SHM for anomaly localization. The embedded hardware/software framework is demonstrated on XC7Z007S as the smallest and most inexpensive Zynq SoC device, see **Fig. 4.1**. To the best of our knowledge, this is the first research addressing 6-bit floating-point quantization on CNN models and its dedicated hardware design.

Our main contributions are as follows:

1. We present the Hybrid-Float6 quantization and its dedicated hardware design. We propose an optimized hardware MAC by reducing the mantissa multiplication to a mux-adder operation. We exploit the intrinsic error tolerance of ANN to further reduce the hardware design with approximation. To preserve model accuracy, we present a quantization-aware training method, which in some cases improves accuracy.
2. We develop a custom hardware/software co-design framework for sensor analytics applications on low-power embedded FPGAs. This architecture integrates TensorFlow Lite.
3. We present a customizable tensor processor as a dedicated hardware for HF6. This design computes *Conv2D* tensor operations employing a pipelined vector dot-product



Figure 4.1.: The workflow of our approach on embedded FPGAs.

with parametrized on-chip memory utilization. The compute engine can be implemented with standard floating-point (with FP Xilinx LogiCORE IPs) and HF6.

4. We demonstrate the potential of our approach with a CNN-regression model for anomaly localization in SHM based on AE. We address a hardware design exploration. We evaluate inference accuracy, compute performance, hardware resource utilization, and energy consumption.

The rest of the paper is organized as follows. Section 4.2 covers the related work; Section ?? introduces the background for *Conv2D* tensor operation and floating-point number representation; Section 4.3 describes the system design of the hardware/software architecture and the quantized aware training method; Section 4.4 presents the experimental results thorough a design exploration flow; Section 4.5 concludes the paper.

This work is available to the community as an open-source project at <https://github.com/YaribNevarez/tensorflow-lite-fpga-delegate.git>.

## 4.2. Related Work

In the literature we find plenty of hardware architectures for CNN accelerators implemented in FPGA. Most of the research implements fixed-point quantization, and very limited research focuses on FP. Moreover, to the best of our knowledge, there is no research work exploring FP inference for low-power embedded applications.

### 4.2.1. Hybrid Custom Floating-Point

In [102], Liangzhen Lai et al. proposed a mixed data representation with floating-point for weights and fixed-point for activations. [102] demonstrated on SqueezeNet, AlexNet, GoogLeNet, and VGG-16 that reduced FP quantization (4-bit exponent and 3-bit mantissa) results in constant negligible accuracy degradation. In [103], Sean O. Settle et al. presented an 8-bit FP quantization scheme, which needs an extra inference batch to compensate for quantization error. However, [102] and [103] did not present a hardware architecture. In [101], Xiaocong Lian et al. proposed an accelerator with optimized block floating-point (BFP), in this design the activations and weights are represented by 16-bit and 8-bit FP formats, respectively. This design is demonstrated on Xilinx VC709 evaluation board. This implementation achieves throughput and power efficiency of 760.83 GOP/s and 82.88 GOP/s/W, respectively.

### 4.2.2. Low-Precision Floating-Point

In [100], Chunsheng Mei et al. presented a hardware accelerator for VGG16 model using half-precision FP (16-bit). This design is demonstrated on Xilinx Virtex-7 (XC7VX690T) with PCIe interface. This implementation achieves throughput and power efficiency of 202.8 GFLOP/s and 18.72 GFLOP/s/W, respectively. In [98], Chen Wu et al. proposed a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration. This design is demonstrated on Xilinx Kintex 7 and Ultrascale/Ultrascale+. This implementation achieves throughput and power efficiency of 1086.8 GOP/s and 115.4 GOP/s/W, respectively.

### 4.2.3. Low-Power Hardware Architectures

Two research papers have been reported hardware accelerators targeting XC7Z007S. This is the smallest and most inexpensive device from Zynq-7000 SoC family. In [104], Paolo Meloni et al. presented a CNN inference accelerator for compact and cost-optimized devices. This implementation uses fixed-point for processing light-weight CNN architectures with a power efficiency between 2.49 to 2.98 GOPS/s/W. In [105], Chang Gao et al. presented EdgeDRNN, a recurrent neural network (RNN) accelerator for edge inference. This implementation adopts the spiking neural network (SNN) inspired delta network algorithm to exploit temporal sparsity in RNNs.

## 4.3. System Design

The system design is a hardware/software co-design framework for low-power AI deployment. This architecture allows design exploration of dedicated hardware integrated with TensorFlow Lite on low-cost embedded FPGAs.

### 4.3.1. Base Embedded System Architecture

The base embedded system architecture implements a cooperative hardware-software platform. See **Fig. 4.2**. The embedded CPU delegates low-level compute-bound tensor operations to the TPs. The TPs employ AXI-Lite interface for configuration and AXI-Stream interfaces via Direct Memory Access (DMA) for data movement from DDR memory. Each TP asserts an interrupt flag once the job/transaction is complete. Interrupt events are handled by the embedded CPU to collect results and start a new transaction. The hardware architecture can vary its resource utilization by customizing the TPs prior to the hardware synthesis.

### 4.3.2. Tensor Processor

The TP is a dedicated hardware module to compute tensor operations. This architecture implements high performance communication with AXI-Stream, direct CPU communication with AXI-Lite, and on-chip storage utilizing BRAM. This hardware architecture is implemented with high-level synthesis (HLS). The tensor operations are implemented based on the C++ TensorFlow Lite micro kernels. See **Fig. 4.3.**

The TP is an extensible hardware module that executes low-level tensor operations. In this paper, we focus on the *Conv2D* tensor operation that computes 2D convolution layers.

#### Modes of Operation

The TP has two modes of operation: *configuration* and *execution*.



Figure 4.2.: Base embedded system architecture.



Figure 4.3.: Hardware architecture of the proposed tensor processor.

- In *configuration* mode, the TP receives the tensor operation hyperparameters: stride, dilation, padding, offset, activation, depth-multiplier, input shape, filter shape, bias shape, and output shape. Afterwards, the TP receives filter and bias tensors, which are locally stored in BRAM. The filter and bias tensors are transferred with standard FP representation. Then, the TP extracts the 6-bit FP format for local on-chip storage.
- In *execution* mode, the TP executes the tensor operation according to the hyperparameters given in the configuration mode. During execution, the input and output tensors are moved from/to the off-chip memory via DMA.

### Dot-Product with Hybrid Floating-Point Computation

We implement the floating-point computation adopting the dot-product with hybrid custom floating-point[56]. The hardware dot-product is illustrated in **Fig. 4.4** and **Fig. 4.5(a)**. This design instantiates an HF6 MAC and an accumulator variable of 64-bit fixed-point with 23-bit fraction. During operation, the feature map and filter values are extracted from on-chip memory (BRAM). Both values have to be different than zero to enable the MAC operation. The result is biased by accumulating a denormalized bias value. Since the bias is stored with 6-bit FP, its fractional part has to be aligned with the 23-bit fraction of the accumulator, see **Fig. 4.5(b)**. The ReLu activation is applied and the result is normalized to converted to IEEE 754 standard FP, see **Fig. 4.5(c)**.

Rather than a parallelized structure, this is a pipelined hardware design suitable for resource-limited devices. The latency in clock cycles of this hardware module is defined by **Eq. (4.1)**, where  $N$  is the vector length. This latency equation is obtained from the general pipelined hardware latency formula:  $L = (N - 1) II + IL$ , where  $II$  is the initiation interval, and  $IL$  is the iteration latency. Both  $II$  and  $IL$  are obtained from the high-level synthesis results. Both the exponent and mantissa bit widths of the filter and bias buffers are set to a 4-bit exponent and a 1-bit mantissa (E4M1), which corresponds to float6 quantization.

$$L_{hf} = N + 7 \quad (4.1)$$

### Multiply-Accumulate

The multiply-accumulate operation calculates the product of two numbers and adds the result to an accumulator. In FP arithmetics, the size of a hardware multiplier scales with the size of the

#### 4. Accelerating Convolutional Neural Networks

mantissas. In the case of HF6, the 6-bit FP representation allows optimization in the mantissa multiplication. The 1-bit mantissa enables efficient MAC implementations by reducing the



Figure 4.4.: Dot-product hardware module with (a) standard floating-point and (b) Hybrid-Float6.



Figure 4.5.: (a) Dot-product hardware module with Hybrid-Float6 MAC, (b) bias accumulation, (c) activation and normalization.

mantissa multiplication to a multiplexed addition, see **Fig. 4.6**. The denormalized results are accumulated in a fixed-point accumulator. This approach reduces latency, energy consumption, and resource utilization.

The Infinity and NaN special cases are not considered in this design since are not expected in ANN computation. For the subnormal case, the element-wise multiplication is disabled when having a zero entry and is approximated when having subnormal mantissa. The feature map values are considered zero when the exponent is zero ( $E_F = 0$ ). The filter values are considered zero when both exponent and mantissa are zero ( $E_H = 0 \wedge M_H = 0$ ). See **Fig. 4.5(a)**. In the 6-bit FP, the 1-bit mantissa has one subnormal case, which is handled as a normalized case. This exploits the intrinsic error tolerance of ANN to reduce the hardware design.

The approximation error is defined by the difference between **Eq. (2.3)** and **Eq. (2.5)** when  $E = 0$  and  $M = 2^{-1}$ . The result defines the error as  $e = 2^{-B-1}$ . Then, from **Eq. (2.4)** with  $E_{size} = 4$ , we have  $B = 7$ . Hence,  $e = 3.9e-3$ . This error is produced when having



Figure 4.6.: Hybrid-Float6 multiply-accumulate hardware design.

the subnormal case  $E = 0$  and  $M = 2^{-1}$ , which corresponds to the value  $\pm 7.8e-3$  deviated to  $\pm 1.17e-2$ . This approximation leverages the intrinsic error tolerance of ANN to reduce hardware resource utilization and energy consumption [31].

### On-Chip Memory Utilization

The total on-chip memory utilization on the TP is defined by **Eq.** (4.2), where  $TP_B$  and  $V_M$  represent the tensor buffers and local variables (memory) required for the design, respectively. **Eq.** (4.3) defines the tensor buffers, where  $Input_M$  is the *input buffer*,  $Filter_M$  is the *filter buffer*,  $Bias_M$  is the *bias buffer*. The on-chip memory buffers are defined in bits. **Fig.** 4.7 illustrates the convolution operation utilizing the on-chip memory buffers.

$$TP_M = TP_B + V_M \quad (4.2)$$

$$TP_B = Input_M + Filter_M + Bias_M \quad (4.3)$$

The memory utilization of *input buffer* is defined by **Eq.** (4.4), where  $K_H$  is the height of the convolution kernel,  $W_I$  is the width of the input tensor,  $C_I$  is the number of input channels, and  $BitSize_I$  is the bit size of input tensor.

$$Input_M = K_H W_I C_I BitSize_I \quad (4.4)$$

The memory utilization of *filter buffer* is defined by **Eq.** (4.5), where  $K_W$  and  $K_H$  are the width and height of the convolution kernel, respectively;  $C_I$  and  $C_O$  are the number of input and



Figure 4.7.: Design parameters for on-chip memory buffers on the TP.

output channels, respectively; and  $BitSize_F$  is the bit size of filter values.

$$Filter_M = C_I K_W K_H C_O BitSize_F \quad (4.5)$$

The memory utilization of *bias buffer* is defined by Eq. (4.6), where  $C_O$  is the number of output channels, and  $BitSize_B$  is the bit size of bias values.

$$Bias_M = C_O BitSize_B \quad (4.6)$$

As a design trade-off, Eq. (4.7) defines the capacity of output channels based on the given design parameters. The total on-chip memory  $TP_M$  determines the TP capacity.

$$C_O = \frac{TP_M - V_M - K_H W_I C_I BitSize_I}{C_I K_W K_H BitSize_F + BitSize_B} \quad (4.7)$$

The floating-point formats implemented in the TP are defined by  $BitSize_F$ ,  $BitSize_B$  and  $BitSize_I$ . The HF6 defines 6-bit for  $BitSize_F$  and  $BitSize_B$ , and 32-bit for  $BitSize_I$ . These are design parameters defined before hardware synthesis. This allows fine control of BRAM utilization, which is suitable for resource-limited devices.

### 4.3.3. Training Method

The models are trained and quantized in separate stages.

#### Training with Iterative Early Stop

To achieve better performance on CNN-regression models, we implement a training procedure with iterative early stop cycle. This allows to reach better local minima. This is a four steps process:

1. A model is obtained with an initial training with standard early stop monitoring.
2. The model is iteratively re-trained with standard early stop to search for better local minima. In each early stop the Adam optimizer restarts the moving averages.
3. In case of a better local minimum, the base model is updated/saved and used for subsequent re-training iterations, otherwise it is discarded.
4. The cyclic process stops automatically with a given patience. This allows to set a maximum training iterations before the stop.

This method is described in **Algorithm 2**.

### Quantization Aware Training

The quantization aware training (QAT) method is integrated into the training process, this operates after each mini-batch update. The quantization is applied on the trainable parameters of convolution layers. This method is implemented as a callback function in the TensorFlow/Keras framework, see **Algorithm 3**.

The quantization method uses rounding strategy to reduce the FP representation. This maps the full precision FP values to the closest representable 6-bit FP values, see **Algorithm 4**. This method quantizes the filter and bias tensors of the convolution layers. We have observed that the exponent bit size plays a more predominant influence on the model accuracy than the mantissa bit size. In [102], Lai et al. demonstrated that 4-bit exponent is adequate and consistent across different networks (SqueezeNet, AlexNet, GoogLeNet, VGG-16). In this work, we investigate 4-bit exponent and 1-bit mantissa.

---

**Algorithm 2:** Training with iterative early stop cycle.

---

```

input: MODEL as the input model.
input:  $D_{train}$  as the training data set.
input:  $D_{val}$  as the validation data set.
input:  $N_I$  as the stop patience for iterative training cycle.
input:  $N_E$  as the early stop patience (epochs) for training.
input:  $B_{size}$  as the mini-batch size.
output: MODEL as the full-precision output model.

 $Train(MODEL, D_{train}, D_{val}, N_E, B_{size})$ 
 $mse_i \leftarrow Evaluate(MODEL, D_{val})$  // Benchmark
 $n_i \leftarrow 0$ 
while  $n_i < N_I$  do
    // Iterative early stop cycle
     $Train(MODEL, D_{train}, D_{val}, N_E, B_{size})$ 
     $mse_v \leftarrow Evaluate(MODEL, D_{val})$ 
    if  $mse_v < mse_i$  then
         $Update(MODEL)$ 
         $mse_i \leftarrow mse_v$ 
    else
         $MODEL \leftarrow LoadPreviousModel()$ 
         $n_I \leftarrow n_I + 1$ 
    end if
end while

```

---

**Algorithm 3:** OnMiniBatchUpdate\_Callback.

---

```

input:  $MODEL$  as the full-precision input model.
input:  $E_{size}$  as the target exponent bits size.
input:  $M_{size}$  as the target mantissa bits size.
input:  $D_{train}$  as the training data set.
input:  $D_{val}$  as the validation data set.
input:  $N_{ep}$  as the number of epochs.
input:  $B_{size}$  as the mini-batch size.
output:  $MODEL$  as the quantized output model.

//Quantize
 $MODEL \leftarrow \text{Algorithm 4}(MODEL, E_{size}, M_{size})$ 
if  $1 < epoch$  then
    // Update model after first epoch
     $mse_v \leftarrow Evaluate(MODEL, D_{val})$ 
    if  $mse_v < mse_i$  then
         $Update(MODEL)$ 
         $mse_i \leftarrow mse_v$ 
    end if
end if

```

---

**4.3.4. Embedded software architecture**

The software architecture is a layered object-oriented application framework written in C++, see **Fig. 4.8.** A description of the software layers is as follows:

- *Application:* As the highest level of abstraction, this software layer implements the application invoking the ML library.
- *Machine learning library:* This layer consist of TensorFlow Lite for micro controllers. This offers a comprehensive high level API that allows ML inference. This provides delegate software interfaces for custom hardware accelerators.



Figure 4.8.: Base embedded software architecture.

---

**Algorithm 4:** Custom floating-point quantization.

---

```

input: MODEL as the CNN.
input:  $E_{size}$  as the target exponent bit size.
input:  $M_{size}$  as the target mantissa bits size.
input:  $STDM_{size}$  as the IEEE 754 mantissa bit size.
output: MODEL as the quantized CNN.

for layer in MODEL do
    if layer is Conv2D or SeparableConv2D then
        filter, bias  $\leftarrow$  GetWeights(layer)
        for x in filter and bias do
            sign  $\leftarrow$  GetSign(x)
            exp  $\leftarrow$  GetExponent(x)
            fullexp  $\leftarrow$   $2^{E_{size}-1} - 1$  // Get full range value
            cman  $\leftarrow$  GetCustomMantissa(x,  $M_{size}$ )
            leftman  $\leftarrow$  GetLeftoverMantissa(x,  $M_{size}$ )
            if exp  $< -fullexp$  then
                x  $\leftarrow$  0
            else if exp  $> fullexp$  then
                x  $\leftarrow (-1)^{sign} \cdot 2^{fullexp} \cdot (1 + (1 - 2^{-M_{size}}))$ 
            else
                if  $2^{STDM_{size}-M_{size}-1} - 1 < leftman$  then
                    cman  $\leftarrow$  cman + 1 // Above halfway
                    if  $2^{M_{size}} - 1 < cman$  then
                        cman  $\leftarrow$  0 // Correct mantissa overflow
                        exp  $\leftarrow$  exp + 1
                    end if
                end if
            end if
            // Build custom quantized floating-point value
            x  $\leftarrow (-1)^{sign} \cdot 2^{exp} \cdot (1 + cman \cdot 2^{-M_{size}})$ 
        end if
    end for
    SetWeights(layer, filter, bias)
end if
end for

```

---

- *Hardware abstraction layer:* This layer consist of the hardware drivers to handle initialization and runtime operation of the TP and DMA.

## 4.4. Experimental Results

In this section, we present experimental results on a low-power/low-cost sensor analytics application. As a use case, we present a CNN-regression model to predict x- y- coordinates of acoustic emissions based on piezoelectric vibrations. We compare quantitative and qualitative aspects of the analytics using floating-point 32-bit, fixed-point 8-bit, Hybrid-Logarithmic 6-bit, and Hybrid-Float6.

To demonstrate the proposed hardware concept, we deploy the CNN model for low-power inference in the smallest Zynq SoC. We compare the performance of the TP implemented with standard FP (using Xilinx LogiCORE IPs) and Hybrid-Float6 architecture.

### 4.4.1. Sensor Analytics Application

The analytics model is designed to predict x- y- coordinates of acoustic emissions on a metal plate. The metal plate is in the presence of noise disturbance to simulate realistic conditions. In this subsection, we present the structure for experimental setup, data sets, and the CNN-regression model.

#### Experimental Setup

The experiment uses eight piezoelectric sensors (Vallen Systeme VS900) attached with magnetic holders on a metal plate (90 cm x 86.6 cm x 0.3 cm). The VS900 devices can operate either in active or passive mode. Six VS900 are used in passive mode as acoustic sensors and two in active mode to produce acoustic emissions. These acoustic emissions simulate anomalies on x- y- coordinates as well as the noise disturbance on the system. See **Fig. 4.9(a)**. The acoustic emissions are labeled with their coordinates to create data sets.

#### Data Sets

The data sets are recorded applying pulses on the metal plate, the x- y- coordinates of these pulses are used as labels. The pulses for training and validation data sets are shown in **Fig. 4.9(b)** and **Fig. 4.9(c)**, respectively. The pulses for training and validation data sets are mutually exclusive, this exclusion is represented by the cross symbols in **Fig. 4.9(c)**. This creates a grid layout used to collect samples for the data sets. This grid is  $10 \times 10$ . This grid does not consider the four corners as they are used for magnetic holders.

In order to create reproducible acoustic emissions, we use 9-cycle sine pulse in a Hanning window with central frequency  $f_c$  (narrow-banded in the frequency domain). We assume guided

#### 4. Accelerating Convolutional Neural Networks

---

Lamb waves based on the plate structure. The narrow-band behavior also reduces the dispersion of the acoustic emission waves [106]. The waveform can be expressed as a function of time  $t$  as follows:

$$x_{\text{pulse}}(t) = \frac{1}{2} \left( 1 - \cos \frac{f_c t}{5} \right) A_0 \sin f_c t. \quad (4.8)$$

To generate the data sets, we use slightly different pulse amplitudes and frequencies for



Figure 4.9.: Experimental setup for sensor analytics on structural health monitoring, all lengths are in meters (m).

excitation. The pulse frequency  $f_c$  is varied in 1 kHz steps between 300 kHz and 349 kHz and the amplitude  $A_0$  is varied in 0.1 V steps between 2.6 V and 3.5 V. This results in 500 different pulses for each of the excitation points.

The signals for labeled pulses and noise disturbance are generated by arbitrary waveform generators (AWGs). The sensor signals are recorded via a Vallen AMSY-6 measurement system with a resolution of 18 bits and a sampling rate of  $f_s = 10\text{ MHz}$ . The disturbance signal is gaussian noise with amplitudes between 0-3 V. This noise is applied via the piezoelectric device  $N$  at  $x = 0.227\text{ m}$  and  $y = 0.828\text{ m}$ , see **Fig. 4.9(a)**.

To obtain both time and frequency domain information, the sampled pulses are converted into the time-frequency domain using the Short-Time Fourier Transform (STFT). This is calculated as follows [107]:

$$\mathcal{F}_{m,k}^{\gamma} = \sum_{n=0}^{N-1} x[n] \cdot \gamma^*[n - m\Delta M] \cdot e^{\frac{-j2\pi kn}{N}} \quad (4.9)$$

Here  $x[n]$  describes a discrete-time signal and  $\gamma^*[n - m\Delta M] \cdot e^{\frac{-j2\pi kn}{N}}$  the time- and frequency-shifted window function inside the considered interval  $[0, N-1]$ .  $\Delta M$  describes the time shift and  $N$  the transformation window. Since only discrete frequencies and time points are considered,  $m = 0, 1, \dots, M - 1$  is valid. This complex-valued STFT is converted to real numbers via the magnitude square for pictorial representation in a spectrogram  $\mathcal{S}_{m,k}$ :

$$\mathcal{S}_{m,k} = \left| \mathcal{F}_{m,k}^{\gamma} \right|^2 = \left| \sum_{n=0}^{N-1} x[n] \cdot \gamma^*[n - m\Delta M] \cdot e^{\frac{-j2\pi kn}{N}} \right|^2 \quad (4.10)$$

In addition, these spectrograms are scaled in decibels. The spectrogram in decibels  $\mathcal{S}_{m,k,\text{dB}}$  results in  $\mathcal{S}_{m,k,\text{dB}} = 20 \cdot \log_{10}(\mathcal{S}_{m,k})$ . For the conversion of the data, we use a signal length of 400  $\mu\text{s}$  (75  $\mu\text{s}$  pretrigger and 325  $\mu\text{s}$  post trigger). Thus, the arrival times of the pulses are included in the spectrogram for all channels and labeled positions. We use a Blackman window function [108], a Fast Fourier Transform (FFT) length of 32 samples, and an overlap of 8 samples. The spectrograms are calculated for frequencies in the range of 100 kHz to 500 kHz. This results in a spectrogram with 8x16 values (8 frequency values, 16 time values).

In order to generate larger data sets, we create four further variants with time shifts of 15  $\mu\text{s}$ / 30  $\mu\text{s}$ / 45  $\mu\text{s}$ / 60  $\mu\text{s}$ . Subsequently, all spectrograms are converted to grayscale with scaling between -100dB and -40dB, see **Fig. 4.10**. Overall, the data set has a size of 500 (pulses)  $\cdot$  5

(spectrograms) · 6 (listening sensors) · 96 (excitation points) = 1,440,000 images.

### CNN-Regression Model

The data analytics is implemented with a CNN-regression model, see **Fig. 4.11**. The structure of the model is described below:

- a) Input tensor. This is composed of spectrograms from the sensor signals. The tensor shape is defined by  $S \times T \times F$ , where  $S$  is the number of sensors, and  $T \times F$  is the time-frequency resolution of the spectrograms, see **Fig. 4.11(a)**.
- b) Feature extraction. This is composed of three blocks of convolution, batch normalization, and max-pooling layers, see **Fig. 4.11(b)**. The number of channels in the convolution layers are defined by the hyper-parameters  $A$ ,  $B$ , and  $C$ .
- c) Regression function. This is an arbitrary function implemented with two fully connected layers and an output layer with linear activation, see **Fig. 4.11(c)**.

## 4.4.2. Training

### Base Model

The model in **Fig. 4.11** is trained using Adam algorithm with iterative early stop. The Adam optimizer is configured with the default settings presented in [109]:  $\alpha = 0.001$ ,  $\beta_1 = 0.9$ ,  $\beta_2 = 0.999$ , and  $\epsilon = 1e-8$ . The training cycle patience is 10 iterations, the optimizer is executed with early stop patience of 10 epochs, and mini-batch size of 512 samples. This is applied using the method described in **Algorithm 2** with  $N_I = 10$ ,  $N_E = 10$ ,  $B_{size} = 512$ .

The training results are illustrated in **Fig. 4.12(a)**. In this experiment, the initial and the final models achieve  $MSE = 0.0135 \text{ m}^2$  and  $MSE = 0.0122 \text{ m}^2$ , respectively. The  $MSE$  is calculated with the Euclidean distance (loss) between the expected and the predicted coordinates. The initial



Figure 4.10.: Spectrograms of sensors  $S_1$ ,  $S_2$  converted to grayscale for pulses at  $x = 0.105 \text{ m}$ ,  $y = 0.109 \text{ m}$  with noise disturbance.

model is obtained at the first early stop. In each stop, the moving averages of the Adam optimizer get initialized. This facilitates searching for better local minima. The model gets saved/updated when finding a better minimum.

The resulting model achieves  $MSE = 0.0122 \text{ m}^2$ , which corresponds to  $MAE = 0.0955 \text{ m}$ . See **Fig. 4.13(a)**. In total, the training takes 379 epochs in 25 stops. The first stop takes 43 epochs for the initial model and subsequent training iterations take an average of 14 epochs. The total time is 53 minutes using a PC with AMD Ryzen 5 5600H and NVIDIA GeForce RTX 3050.

### TensorFlow Lite 8-bit Quantization

This integer quantization is an optimization method that converts 32-bit FP numbers (such as weights and activations) to 8-bit fixed-point numbers. This quantization scheme allows inference to be carried out using integer-only arithmetic[106].

The base model is quantized using the TensorFlow Lite library with integer-only quantization settings. The filter and bias tensors are represented by 8-bit and 32-bit signed integers, respectively. The input and output activation tensors are represented by 8-bit signed integer. For convolution layers, this quantization includes two additional tensor coefficients (output-multiplier and output-shift). These tensors are the same shape as the bias tensor and represented by 32-bit signed integers as well.

The fixed-point model achieves  $MSE = 0.0122 \text{ m}^2$  and  $MAE = 0.0952 \text{ m}$ . See **Fig. 4.13(b)**.



Figure 4.11.: CNN-regression model for sensor analytics.

#### 4. Accelerating Convolutional Neural Networks

---

The MAE obtains a reduction of 0.31% compared to the base model. We attribute this to the regularization effect.

#### Quantization Aware Training for Hybrid-Float6

The QAT is a post-training step. We run this method during two epochs with mini-batch size of 10 samples with 4-bit exponent and 1-bit mantissa as parameters. This is applied using the method described in **Algorithm 3** with  $N_{ep} = 2$ ,  $B_{size} = 10$ ,  $E_{size} = 4$ ,  $M_{size} = 1$ .

The QAT is illustrated in **Fig. 4.12(b)**. First, the model gets quantized with HF6 format before starting QAT, this obtains  $MSE = 0.0188 \text{ m}^2$  and  $MAE = 0.1232 \text{ m}$ . This illustrates the inference of the base FP model (without QAT) on HF6 hardware. See **Fig. 4.13(c)**. Then, after QAT, the final model achieves  $MSE = 0.0112 \text{ m}^2$  and  $MAE = 0.0919 \text{ m}$ . This corresponds to an error reduction of 8.2% and 3.77%, respectively. See **Fig. 4.13(d)**. The QAT time is 185 minutes.



b) Quantization aware training



Figure 4.12.: Training results.



Figure 4.13.: Performance of the model with different data representations.

### Quantization Aware Training for Hybrid-Logarithmic 6-bit

For the sake of quality comparison, we generate the model with 6-bit logarithmic quantization on convolution parameters, see **Fig. 2.4(e)**. This quantization matches the bit size of HF6. We run the QAT on the base model with hybrid logarithmic parameters. Then, the filter and bias tensors of convolution layers are represented by 6-bit signed logarithmic. This is applied using the method described in **Algorithm 3** with  $N_{ep} = 2$ ,  $B_{size} = 10$ ,  $E_{size} = 5$ ,  $M_{size} = 0$ .

In this case, the model gets quality degradation. The model reaches  $MSE = 0.0123 \text{ m}^2$  and  $MAE = 0.0968 \text{ m}$ , which correspond to an error increase of 0.82% and 1.36%, respectively. See **Fig. 4.13(e)**.

A summary of inference with different data representations is presented in **Fig. 4.13(f)**.

### 4.4.3. Hardware Design Exploration

The proposed hardware/software co-design is demonstrated on the Zynq-7007S system-on-chip (SoC) on the MiniZed development board. This SoC integrates a single ARM Cortex-A9 processing system (PS) and a programmable logic (PL) equivalent to Xilinx Artix-7 (FPGA) in a single chip [90]. The Zynq-7007S SoC architecture maps the custom logic and software in the PL and PS, respectively.

In this platform, we implement the proposed hardware/software architecture to deploy the sensor analytics application. The desired model is converted to TensorFlow Lite (floating-point) and deployed on the SoC as a hex dump in a C array, this is used for the embedded software build. The Zynq-7007S SoC performs inference running TensorFlow Lite on the PS. The computational workload of convolution layers is delegated to the dedicated hardware.

#### Benchmark on Embedded CPU

We explore the performance of the embedded CPU for inference without hardware acceleration. In this case, TensorFlow Lite creates the CNN as a sequential model that allocates all computation to the CPU (ARM Cortex-A9) at 666 MHz and power dissipation of 1,187 W.

The compute performance and run-time inference of the CPU are shown in **Tab. 4.2(a)** and **Fig. 4.15(a)**, respectively.

### Benchmark on Tensor Processor with Standard Floating-Point Hardware using Xilinx LogiCORE IP

For this design, we implement the TP with standard FP hardware prior synthesis. The design parameters are:

- Max convolution kernel size:  $K_W = K_H = 3$ .
- Max input tensor width:  $W_I = 16$ .
- Max input and output channels:  $C_I = 55, C_O = 60$ .
- Filter and bias bit size:  $BitSize_F = BitSize_B = 32$ .
- Input tensor bit size:  $BitSize_I = 32$ .

Using equations from Section 4.3.2, the on-chip memory buffer utilization are  $Input_M = 84,480\text{b}$ ,  $Filter_M = 950,400\text{b}$ , and  $Bias_M = 1,920\text{b}$ . Hence, the required on-chip memory buffer size is  $TP_B = 1,036,800\text{b}$ .

The post-implementation resource utilization and power dissipation are presented in **Tab.** 4.1(a). The complete hardware platform utilizes 83% of BRAM, this includes the on-chip memory requirements of the TP, DMA, and AXI interconnects. The total available on-chip memory (BRAM) on the Zynq-7007S SoC is 1.8Mb. The estimated power dissipation of the TP is 85 mW at 200 MHz (this estimation is provided by Xilinx Vivado).

Table 4.1.: Resource utilization and power dissipation on the Zynq-7007S SoC.

| TP engine          | Post-implementation resource utilization |               |           |             | Power (W) |
|--------------------|------------------------------------------|---------------|-----------|-------------|-----------|
|                    | LUT                                      | FF            | DSP       | BRAM 36Kb   |           |
| (a) Floating-Point | 5,578<br>39%                             | 8,942<br>31%  | 23<br>35% | 41.5<br>83% | 1.429     |
| (b) Hybrid-Float6  | 7,313<br>51%                             | 10,330<br>36% | 20<br>30% | 15<br>30%   | 1.424     |

The compute performance and inference schedule of the model on this hardware implementation are shown in **Tab.** 4.2(b) and **Fig.** 4.15(b), respectively. In this implementation, TensorFlow Lite delegates computation of *Conv2D* tensor operations to the TP.

The implementation of the dot-product with standard FP engine (IEEE 754 arithmetic) utilizes proprietary multiplier and adder FP operator cores. Vivado HLS implements FP arithmetic operations by mapping them onto Xilinx LogiCORE IP cores, these FP operator cores are instantiated in the resultant RTL [91]. In this case, the implementation of the dot-product with the standard FP computation reuses the multiplier and adder cores in different compute

#### 4. Accelerating Convolutional Neural Networks

---

Table 4.2.: Compute performance of the CPU and TP on each Conv2D tensor operation. This table presents: tensor operation, computational cost in mega floating-point operations (MFLOP), latency, throughput, power efficiency, and estimated energy consumption as the energy delay product (EDP).

| Operation                                           | MFLOP | t (ms) | MFLOP/s | MFLOP/s/W | EDP (mJ) |
|-----------------------------------------------------|-------|--------|---------|-----------|----------|
| <b>a) CPU (ARM Cortex-A9) @666MHz, 1.187 W</b>      |       |        |         |           |          |
| Conv_A                                              | 0.691 | 112.24 | 6.16    | 5.19      | 133.23   |
| Conv_B                                              | 1.584 | 213.13 | 7.43    | 6.26      | 252.99   |
| Conv_C                                              | 0.475 | 46.59  | 10.20   | 8.59      | 55.31    |
| <b>b) TP (Floating-Point engine) @200MHz, 85 mW</b> |       |        |         |           |          |
| Conv_A                                              | 0.691 | 12.49  | 55.34   | 651.11    | 1.06     |
| Conv_B                                              | 1.584 | 16.39  | 96.66   | 1,137.20  | 1.39     |
| Conv_C                                              | 0.475 | 3.59   | 132.44  | 1,558.13  | 0.30     |
| <b>c) TP (Hybrid-Float6 engine) @200MHz, 84 mW</b>  |       |        |         |           |          |
| Conv_A                                              | 0.691 | 6.92   | 99.81   | 1,188.24  | 0.58     |
| Conv_B                                              | 1.584 | 4.41   | 358.94  | 4,273.09  | 0.37     |
| Conv_C                                              | 0.475 | 0.99   | 482.44  | 5,743.29  | 0.08     |

sections of the TP. The post-implementation resource utilization and power dissipation of the floating-point operator cores are shown in **Tab.** 4.3.

Table 4.3.: Resource utilization and power dissipation of multiplier and adder floating-point (IEEE 754) operator cores (Xilinx LogiCORE IP).

| Core operation | DSP | FF  | LUT | Latency (clk) | Power (mW) |
|----------------|-----|-----|-----|---------------|------------|
| Multiplier     | 3   | 151 | 325 | 4             | 7          |
| Adder          | 2   | 324 | 424 | 8             | 6          |

#### Tensor Processor with Hybrid-Float6 Hardware

To demonstrate the proposed design, the TP with HF6 hardware restates the standard FP design parameters with the following customization for the FP 6-bit quantization in filter and bias:  $BitSize_F = BitSize_B = 6$ .



Figure 4.14.: Inference acceleration and power reduction on the TP with floating-point and HF6 vs. CPU on the Zynq-7007S SoC.

Using equations from Section 4.3.2, the on-chip memory requirements are  $Input_M = 84,480\text{ b}$ ,  $Filter_M = 178,200\text{ b}$ ,  $Bias_M = 360\text{ b}$ . Hence, the required on-chip memory buffer size is  $TP_B = 263,040\text{ b}$ .

The post-implementation resource utilization and power dissipation are presented in **Tab. 4.1(b)**. The complete hardware platform utilizes 30% of BRAM, this includes the on-chip memory requirements of the TP, DMA, and AXI interconnects. The estimated power dissipation of the TP is 84 mW at 200 MHz (this estimation is provided by Xilinx Vivado).

The compute performance and inference schedule of the model on this hardware implementation are shown in **Tab. 4.2(c)** and **Fig. 4.15(c)**, respectively. **Fig. 4.14** presents a comparison of the acceleration and the reduction of power dissipation between standard FP and HF6 hardware implementations.

This deployment does not require model parameter extraction/treatment. The 6-bit FP representation is wrapped into the standard FP. The dedicated hardware design extracts the 6-bit format automatically and performs computation.

#### 4.4.4. Discussion

##### Training and Quantization

The training with iterative early stop obtains a model with enhanced accuracy than standard early stop. This method iteratively resets the moving averages of Adam’s optimizer, which helps to find better local minima. This iterative search is suitable for models with low computational cost.

The TensorFlow Lite 8-bit quantization preserves the overall model accuracy. In some cases, the associated regularization effect can improve the accuracy. However, the error distribution in CNN linear regressions gets slightly degraded. **Fig. 4.16(b)** shows this effect on three different models, where vertical and horizontal patterns appear in the error distribution of fixed-point quantization. We attribute this effect to the 8-bit resolution in the activation maps. In the case of HF6 quantization, the activation maps are represented by FP preventing this degradation.

The proposed 6-bit FP representation (E4M1) in convolution parameters improves latency, hardware area, and power dissipation, while preserving model accuracy. In our application, this number format produces better results than the 6-bit logarithmic representation (E5M0). This is demonstrated in **Fig. 4.13(d)** and **Fig. 4.13(e)**.

Applying HF6 quantization on ALL-CNN-C [110] produces an accuracy degradation of 1.39% and 0.11% with QAT. While applying 6-bit logarithmic produces a degradation of 11.18% and 7.22% with QAT.

#### 4. Accelerating Convolutional Neural Networks

---



Figure 4.15.: Run-time inference of TensorFlow Lite on the Zynq-7007S SoC. (a) CPU ARM Cortex-A9 at 666 MHz, (b) cooperative CPU + TP with floating-point Xilinx LogiCORE IP at 200 MHz, and (c) cooperative CPU + TP with Hybrid-Float6 at 200 MHz.

## Implementation and Performance

The proposed HF6 implementation reduces on-chip memory and DSP utilization while slightly increasing FF and LUT compared to the standard FP implementation. See **Tab. 4.1** and **Fig. 4.17**. The HF6 logic is implemented using FF and LUT, while the FP logic is implemented using Xilinx LogiCORE IPs with DSPs.

The compute performance of the CPU and TP on each convolution layer of the model is presented in **Tab. 4.2** and **Fig. 4.14**. The peak acceleration and power efficiency of the TP with standard FP is  $13\times$  and 1,558.13 MFLOPS/s/W, respectively. The peak acceleration and power efficiency of the TP with HF6 is  $48.3\times$  and 5,743.29 MFLOPS/s/W, respectively. The HF6 hardware demonstrates an improvement of  $3.7\times$  in acceleration and power efficiency with respect to the standard FP hardware. See **Fig. 4.14**. The estimated power dissipation on the SoC is presented in **Fig. 4.18**. This shows a very similar breakdown of power dissipation in both implementations. However, the energy efficiency is increased due to the reduced latency in HF6 hardware. A comparison of related work is presented in **Tab. 4.4**.



Figure 4.16.: 2D error distribution of three CNN-regression models.



Figure 4.17.: Hardware resource utilization on the Zynq-7007S SoC.

#### 4. Accelerating Convolutional Neural Networks

---

The run-time inference of TensorFlow Lite on the SoC is illustrated in **Fig. 4.15**. This shows the convolution layers as the compute-bound operations. The proposed embedded platform is a cooperative system where the convolution operations are delegated to the dedicated hardware accelerator. The ARM CPU obtains a latency of 387 ms (2.58 FPS). The platform with standard FP hardware obtains a latency of 48 ms (20.8 FPS), while the implementation with HF6 obtains a latency of 27.9 ms (35.84 FPS). These represent an overall acceleration of 8 $\times$  and 13.87 $\times$  over the CPU, respectively.

For ML compatibility/portability, the 6-bit FP is wrapped into the standard FP representation. The dedicated hardware design extracts the 6-bit format automatically and performs the computation.

### SoC Design and Compatibility

The proposed design is an alternative for low-power FP inference. The system runs as a cooperative hardware/software mechanism. This architecture delegates compute-bound tensor operations to a hardware accelerator.

The hybrid 32-bit FP and 6-bit FP quantization enables high quality of results and backward ML compatibility. Backwards ML compatibility gives portability from training to inference. This enables to run inference of HF6 quantized models on standard FP hardware and vice versa. Running inference of models without QAT allows rapid deployment; however, this will incur in accuracy degradation, see **Fig. 4.13(c)**.

### Future Work

To reduce energy consumption, activation maps can be represented by Bfloat16. This would reduce hardware resource utilization, memory footprint, and data transfer. To increase performance, this implementation would require matching computational throughput with memory



Figure 4.18.: Estimated power dissipation on the Zynq-7007S SoC with PS at 666 MHz and PL at 200 MHz.

Table 4.4.: Comparison of hardware implementation with related work.

| Platform                          | Chunsheng Mei et al. [100] | Chen Wu et al. [98] | BFP [101]         | Paolo Meloni et al. [104] | This work         |
|-----------------------------------|----------------------------|---------------------|-------------------|---------------------------|-------------------|
| Device                            | XC7VX690T                  | XC7K325T            | XC7VX690T         | XC7Z007S                  | XC7Z007S          |
| Year                              | 2017                       | 2019                | 2019              | 2019                      | 2019              |
| Dev. kit cost                     | \$7,494                    | \$1,299             | \$7,494           | \$89                      | \$89              |
| Format (activation/weight)        | FP 16-bit                  | FP 8-bit / 8-bit    | FP 16-bit / 8-bit | INT 16-bit                | FP 32-bit / 6-bit |
| Frequency (MHz)                   | 200                        | 200                 | 200               | 80                        | 200               |
| Peak power efficiency (GFLOP/s/W) | 18.72                      | 115.40              | 82.88             | 2.98                      | 5.7               |
| Peak throughput (GFLOP/s)         | 202.42                     | 1086.8              | 760.83            | 10.62                     | 0.4               |
| Wall plug power (W)               | 10.81                      | 9.42                | 9.18              | 2.5                       | 2.5               |
| BRAM 36Kb utilization             | 196.5                      | 234.5               | 913               | 44                        | 44                |
| DSP utilization                   | 1728                       | 768                 | 1027              | 54                        | 54                |

bandwidth using systolic arrays to replace the pipeline structure.

## 4.5. Conclusions

In this paper, we present the Hybrid-Float6 quantization for floating-point CNN hardware acceleration. Feature maps and weights are represented by 32-bit and 6-bit floating-point, respectively. The 6-bit floating-point format is composed of 1-bit sign, 4-bit exponent, and 1-bit mantissa. The 1-bit mantissa enables low-power multiply-accumulate implementations by reducing the mantissa multiplication to a multiplexer-adder operation. We exploit the intrinsic error tolerance of neural networks to further reduce the hardware design with approximation. This approach improves latency, hardware area, and energy consumption. To preserve accuracy, we introduce a quantization aware training method that, in some cases, improves accuracy. We present a lightweight tensor processor implementing a pipelined vector dot-product. For ML compatibility/portability, the 6-bit FP is wrapped in the standard floating-point format, which is automatically extracted by the proposed hardware. The hardware/software architecture is compatible with TensorFlow Lite. We evaluate the applicability of our approach with a CNN-regression model for anomaly localization in a structural health monitoring application based on acoustic emissions. The embedded hardware/software framework is demonstrated on XC7Z007S as the smallest Zynq-7000 SoC. The proposed hardware achieves a peak power efficiency and acceleration on convolution layers of 5.7 GFLOPS/s/W and 48.3 $\times$ , respectively.



---

## 5. Conclusion and Outlook

---

|                                                |           |
|------------------------------------------------|-----------|
| <b>5.1. Summary of Contributions . . . . .</b> | <b>80</b> |
| <b>5.2. Future Works . . . . .</b>             | <b>80</b> |

---

The use of AI is entering a new era based on the use of ubiquitous embedded connected devices. The sustainability of this transformation requires the adoption of design techniques that reconcile accurate results with cost-effective system architectures. As such, improving the efficiency of AI hardware engines as well as ML portability must be considered.

In the emerging era of Industry 4.0, ML algorithms yield the power of AI to massively ubiquitous IoT devices. Applications in this field become smarter and more profitable as the availability of big data gets expanded, driving evolution of many aspects in science, industry, and daily life. However, state-of-the-art ML algorithms, specially SNN and CNN, represent elevated computational and energy costs. Therefore, hardware efficiency is one of the major goals to innovate compute engines as they are the machinery of the future.

Energy, performance, and chip-area are the key design concerns in computer systems. Considering the intrinsic error resilience of ML algorithms, paradigms such as approximate computing come to the rescue by offering promising efficiency gains to assist the aforementioned design concerns. Approximation techniques are widely used in ML algorithms at the model-structure as well as at the hardware processing level. However, state-of-the-art methods do not sufficiently address accelerator designs for ANN, in particular with FP computation.

To sustain the continuous expansion of ML applications on cost-effective compute devices, approximate computing will gradually transform from a design alternative to an essential prerequisite. This dissertation focuses on the investigation of design methodologies to exploit the intrinsic error resilience of ML algorithms to optimize FP inference in low-power embedded systems.

## 5.1. Summary of Contributions

In the field of SNN, this dissertation presents a hardware design methodology for low-power inference of SbS neural networks targeting embedded applications. This ML algorithm provides exceptional noise robustness and reduced complexity compared to conventional SNN with LIF mechanism. However, SbS networks represent a memory footprint and a computational cost unsuitable for embedded applications. To address this problem, this work exploits the intrinsic error resilience of SbS to improve performance and to reduce hardware complexity. More precisely, we design a vector dot-product module based on approximate computing with configurable quality using hybrid custom FP and logarithmic number representations. This approach reduces computational run-time, memory footprint, and power dissipation while preserving inference accuracy. To demonstrate this approach, we address a design exploration flow with HLS on a FPGA. The proposed design reduces  $20.5\times$  run-time and  $8\times$  weight memory footprint, with less than 0.5% of accuracy degradation without retraining on a handwritten digit classification task.

In the field of CNN, this dissertation presents a hardware design methodology for low-power inference targeting sensor analytics applications. In this work, we present the HF6 quantization and its dedicated hardware processor. We propose an optimized FP MAC hardware by reducing the mantissa multiplication to a multiplexer-adder operation. We exploit the intrinsic error tolerance of neural networks to further reduce the hardware design with approximation on the subnormal number computation. To preserve model accuracy, we present a QAT method, which in some cases improves accuracy. We demonstrate this concept in 2D convolution layers. We present a lightweight TP implementing a pipelined vector dot-product. For ML portability, the custom FP representation is wrapped in the standard format, which is automatically extracted by the proposed hardware. The hardware/software architecture is integrated with TF Lite. We evaluate the applicability of our approach with a CNN-regression model for anomaly localization in a SHM application based on AE. The embedded hardware/software framework is demonstrated on XC7Z007S as the smallest Zynq-7000 SoC. The proposed implementation achieves a peak power efficiency and acceleration of 5.7 GFLOPS/s/W and  $48.3\times$ , respectively.

The outcome of this dissertation aims to contribute to the rise of a sustainable next generation of low-power FP neural network processors with ML portability as a design philosophy.

## 5.2. Future Works

---

# A. Appendix

## A.1. SbS algorithm

The SbS network inference is described in **Algorithm 5**, while spike production and layer update are described in **Algorithm 6** and **Algorithm 7**, respectably.

---

**Algorithm 5:** SbS network inference.

---

**input:** Layers of the network as  $H^l$ , where  
 $l$  is the layer index.  
**input:**  $N_L$  as the number of layers.  
**input:**  $N_X^l, N_Y^l$  as the size of layers.  
**input:**  $N_{Spk}$  as the number of spikes for inference (iterations).  
**output:**  $H^l$ .

**for**  $t = 0$  **to**  $N_{Spk} - 1$  **do**

*Initialization of  $H^l(i_X, i_Y, :)$  :*

**if**  $t == 0$  **then**

**for**  $l = 0$  **to**  $N_L - 1$  **do**

**for**  $i_X = 0, i_Y = 0$  **to**  $N_X^l - 1, N_Y^l - 1$  **do**

**for**  $i_H = 0$  **to**  $N_H^l - 1$  **do**

$H^l(i_X, i_Y, i_H) = 1/N_H^l$

**end for**

**end for**

**end if**

*Production of spikes :*

**for**  $l = 0$  **to**  $N_L - 1$  **do**

**if**  $l == 0$  **then**

Draw spikes from input // (Algorithm 6)

**else**

Draw spikes from  $H^l$  // (Algorithm 6)

**end if**

**end for**

*Update layers :*

**for**  $l = 0$  **to**  $N_L - 1$  **do**

Update  $H^l$  // (Algorithm 7)

**end for**

**end for**

---

---

**Algorithm 6:** Spike production.

**input:** Layer as  $H_t \in \mathbb{R}^{N_X \times N_Y \times N_H}$ , where  
 $N_X$  is the layer width,  
 $N_Y$  is the layer height  
 $N_H$  is the length of  $\vec{h}$  (IP vector).  
**output:** Output spikes as  $S_t^{out} \in \mathbb{N}^{N_X \times N_Y}$

```

1: for  $i_X = 0, i_Y = 0$  to  $N_X - 1, N_Y - 1$  do
2:   Generate spike :
3:    $th = MT19937PseudoRandom() / (2^{32} - 1)$ 
4:    $acu = 0$ 
5:   for  $i_H = 0$  to  $N_H - 1$  do
6:      $acu = acu + H_t(i_X, i_Y, i_H)$ 
7:     if  $th \leq acu$  or  $i_H == N_H - 1$  then
8:        $S_t^{out}(i_X, i_Y) = i_H$ 
9:     end if
10:   end for
11: end for

```

---

---

## A. Appendix

---



---

**Algorithm 7:** SbS layer update.

---

**input:** Layer as  $H \in \mathbb{R}^{N_X \times N_Y \times N_H}$ , where

$N_X$  is the layer width,

$N_Y$  is the layer height

$N_H$  is the length of  $\vec{h}$  (IP vector).

**input:** Synaptic matrix as  $W \in \mathbb{R}^{K_X \times K_Y \times M_H \times N_H}$ , where

$K_X \times K_Y$  is the size of the convolution/pooling kernel,

$M_H$  is the length of  $\vec{h}$  from previous layer,

$N_H$  is the length of  $\vec{h}$  from this layer.

**input:** Input spike matrix from previous layer as  $S_t^{in} \in \mathbb{N}^{N_{Xin} \times N_{Yin}}$ , where

$N_{Xin}$  is the width of the previous layer,

$N_{Yin}$  is the height of the previous layer.

**input:** Strides of X and Y as  $stride_X$  and  $stride_Y$ , respectively.

**input:** Epsilon as  $\epsilon \in \mathbb{R}$ .

**output:** Updated layer as  $H^{new} \in \mathbb{R}^{N_X \times N_Y \times N_H}$ .

*Update layer :*

```

1:  $z_X = 0 //$  X and Y index for  $S_t^{in}$ 
2:  $z_Y = 0$ 
3: for  $i_Y = 0$  to  $N_Y - 1$  do
4:   for  $i_X = 0$  to  $N_X - 1$  do
5:      $\vec{h} = H(i_X, i_Y, :)$ 
      Update IP :
6:     for  $j_X = 0, j_Y = 0$  to  $K_X - 1, K_Y - 1$  do
7:        $s_t = S_t^{in}(z_X + j_X, z_Y + j_Y)$ 
8:        $\vec{w} = W(j_X, j_Y, s_t, :)$ 
9:        $\vec{p} = 0$ 
        Dot-product :
10:       $r = 0$ 
11:      for  $j_H = 0$  to  $N_H - 1$  do
12:         $\vec{p}(j_H) = \vec{h}(j_H)\vec{w}(j_H)$ 
13:         $r = r + \vec{p}(j_H)$ 
14:      end for
15:      if  $r \neq 0$  then
16:        Update IP vector :
17:        for  $i_H = 0$  to  $N_H - 1$  do
18:           $h^{new}(i_H) = \frac{1}{1+\epsilon} \left( h(i_H) + \epsilon \frac{\vec{p}(i_H)}{r} \right)$ 
19:        end for
        Set the new H vector for the layer :
20:         $H^{new}(i_X, i_Y, :) = \vec{h}^{new}$ 
21:      end if
22:    end for
23:     $z_X = z_X + stride_X$ 
24:  end for
25:   $z_Y = z_Y + stride_Y$ 
26: end for

```

---

## Acronyms



# List of Figures

|      |                                                                                                                                                                                                                                                                                                                       |    |
|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 2.1. | SbS network architecture for handwritten digit classification task. . . . .                                                                                                                                                                                                                                           | 13 |
| 2.2. | SbS IPs as independent computational entities, (a) illustrates an input layer with a massive amount of IPs operating as independent computational entities, (b) shows a hidden layer with an arbitrary amount of IPs as independent computational entities, (c) exhibits a set of neurons grouped in an IP. . . . .   | 13 |
| 2.3. | (a) Performance classification of SbS NN versus equivalent CNN, and (b) example of the first pattern in the MNIST test data set with different amounts of positive additive uniformly distributed noise. . . . .                                                                                                      | 14 |
| 2.4. | Floating-point number representation. . . . .                                                                                                                                                                                                                                                                         | 16 |
| 3.1. | Dot-product hardware module with (a) standard floating-point (IEEE 754) arithmetic, (b) hybrid custom floating-point approximation, and (c) hybrid logarithmic approximation. . . . .                                                                                                                                 | 22 |
| 3.2. | System-level overview of the embedded software architecture. . . . .                                                                                                                                                                                                                                                  | 26 |
| 3.3. | System-level hardware architecture with scalable number of heterogeneous PUs: <i>Spike</i> , <i>Conv</i> , <i>Pool</i> , and <i>FC</i> . . . . .                                                                                                                                                                      | 26 |
| 3.4. | The <i>Conv</i> processing unit and its six stages: (a) receive IP vector, (b) spike firing, (c) receive spike kernel, (d) update dynamics, (e) dispatch new IP vector, (f) dispatch output spike matrix. . . . .                                                                                                     | 28 |
| 3.5. | Dot-product hardware module with standard floating-point (IEEE 754) computation, (a) exhibits the initiation interval of 10 clock cycles, (b) presents the iteration latency of 19 clock cycles, (c) shows the pairwise product block in dark-gray, and (d) illustrates the accumulation block in light-gray. . . . . | 31 |
| 3.6. | Dot-product hardware module with hybrid custom floating-point approximation, (a) exhibits the initiation interval of 2 clock cycles, (b) presents the iteration latency of 13 clock cycles, (c) shows the pairwise product blocks in dark-gray, and (d) illustrates the accumulation blocks in light-gray. . . . .    | 33 |

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |    |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 3.7. Dot-product hardware module with hybrid logarithmic approximation, (a) exhibits the initiation interval of 2 clock cycles, (b) presents the iteration latency of 9 clock cycles, (c) shows the pairwise product block in dark-gray, and (d) illustrates the accumulation blocks in light-gray. . . . .                                                                                                                                                                                                                                                                                                                                   | 34 |
| 3.8. Computation on embedded CPU. . . . .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 35 |
| 3.9. System overview of the top-level architecture with 8 processing units. . . . .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 35 |
| 3.10. Performance of processing units with standard floating-point (IEEE 754) computation. . . . .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 36 |
| 3.11. Performance bottleneck of cyclic computation on processing units with standard floating-point (IEEE 754) arithmetic, (a) exhibits the starting of $t_{PU}$ of <i>Conv2</i> on a previous computation cycle, (b) presents $t_{CPU}$ of <i>Conv2</i> on the current computation cycle, (c) shows the CPU waiting time (in gray color) for <i>Conv2</i> as a busy resource (awaiting for <i>Conv2</i> interruption), and (d) illustrates the $t_f$ from the previous computation cycle, the starting of $t_{PU}$ on the current computation cycle ( <i>Conv2</i> interruption on completion, and start current computation cycle). . . . . | 37 |
| 3.12. Noise tolerance on hardware PU with standard floating-point (IEEE 754) computation (benchmark/reference), (a) exhibits accuracy degradation applying 50% of noise amplitude, and (b) illustrates convergence of inference with 400 spikes. . . . .                                                                                                                                                                                                                                                                                                                                                                                      | 40 |
| 3.13. $\log_2$ -histogram of each synaptic weight matrix showing the percentage of matrix elements with given integer exponent. . . . .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 41 |
| 3.14. Performance on processing units with hybrid custom floating-point approximation, (a) exhibits computation schedule, (b) presents cyclic computation schedule, and (c) shows the performance of <i>Conv2</i> from a previous computation cycle during the preprocessing of <i>H1_CONV</i> on the current computation cycle without bottleneck. . . . .                                                                                                                                                                                                                                                                                   | 42 |
| 3.15. Noise tolerance on hardware PU with custom floating-point approximation, (a) exhibits accuracy degradation applying 50% of noise amplitude, and (b) illustrates convergence of inference with 400 spikes. . . . .                                                                                                                                                                                                                                                                                                                                                                                                                       | 44 |
| 3.16. Performance of processing units with hybrid logarithmic approximation, (a) exhibits computation schedule, and (b) illustrates cyclic computation schedule. . . . .                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 45 |
| 3.17. Noise tolerance on hardware PU with hybrid logarithmic approximation, (a) exhibits accuracy degradation applying 40% of noise amplitude, (b) illustrates convergence of inference with 600 spikes. . . . .                                                                                                                                                                                                                                                                                                                                                                                                                              | 46 |

|                                                                                                                                                                                                                                                                                                                                                                                                                                         |    |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 3.18. Power dissipation breakdown of platform implementations, (a) ?? architecture with homogeneous AUs using standard floating-point arithmetic (IEEE 754), (b) reference architecture with specialized heterogeneous PUs using standard floating-point arithmetic (IEEE 754), (c) proposed architecture with hybrid custom floating-point approximation, and (d) proposed architecture with hybrid logarithmic approximation. . . . . | 47 |
| 4.1. The workflow of our approach on embedded FPGAs. . . . .                                                                                                                                                                                                                                                                                                                                                                            | 51 |
| 4.2. Base embedded system architecture. . . . .                                                                                                                                                                                                                                                                                                                                                                                         | 54 |
| 4.3. Hardware architecture of the proposed tensor processor. . . . .                                                                                                                                                                                                                                                                                                                                                                    | 54 |
| 4.4. Dot-product hardware module with (a) standard floating-point and (b) Hybrid-Float6. . . . .                                                                                                                                                                                                                                                                                                                                        | 56 |
| 4.5. (a) Dot-product hardware module with Hybrid-Float6 MAC, (b) bias accumulation, (c) activation and normalization. . . . .                                                                                                                                                                                                                                                                                                           | 56 |
| 4.6. Hybrid-Float6 multiply-accumulate hardware design. . . . .                                                                                                                                                                                                                                                                                                                                                                         | 57 |
| 4.7. Design parameters for on-chip memory buffers on the TP. . . . .                                                                                                                                                                                                                                                                                                                                                                    | 58 |
| 4.8. Base embedded software architecture. . . . .                                                                                                                                                                                                                                                                                                                                                                                       | 61 |
| 4.9. Experimental setup for sensor analytics on structural health monitoring, all lengths are in meters (m). . . . .                                                                                                                                                                                                                                                                                                                    | 64 |
| 4.10. Spectrograms of sensors $S_1, S_2$ converted to grayscale for pulses at $x = 0.105$ m, $y = 0.109$ m with noise disturbance. . . . .                                                                                                                                                                                                                                                                                              | 66 |
| 4.11. CNN-regression model for sensor analytics. . . . .                                                                                                                                                                                                                                                                                                                                                                                | 67 |
| 4.12. Training results. . . . .                                                                                                                                                                                                                                                                                                                                                                                                         | 68 |
| 4.13. Performance of the model with different data representations. . . . .                                                                                                                                                                                                                                                                                                                                                             | 69 |
| 4.14. Inference acceleration and power reduction on the TP with floating-point and HF6 vs. CPU on the Zynq-7007S SoC. . . . .                                                                                                                                                                                                                                                                                                           | 72 |
| 4.15. Run-time inference of TensorFlow Lite on the Zynq-7007S SoC. (a) CPU ARM Cortex-A9 at 666 MHz, (b) cooperative CPU + TP with floating-point Xilinx LogiCORE IP at 200 MHz, and (c) cooperative CPU + TP with Hybrid-Float6 at 200 MHz. . . . .                                                                                                                                                                                    | 74 |
| 4.16. 2D error distribution of three CNN-regression models. . . . .                                                                                                                                                                                                                                                                                                                                                                     | 75 |
| 4.17. Hardware resource utilization on the Zynq-7007S SoC. . . . .                                                                                                                                                                                                                                                                                                                                                                      | 75 |
| 4.18. Estimated power dissipation on the Zynq-7007S SoC with PS at 666 MHz and PL at 200 MHz. . . . .                                                                                                                                                                                                                                                                                                                                   | 76 |



# List of Tables

|       |                                                                                                                                                                                                                                                                                                   |    |
|-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 2.1.  | SbS network architecture for handwritten digit classification task. . . . .                                                                                                                                                                                                                       | 14 |
| 3.1.  | Computation on embedded CPU. . . . .                                                                                                                                                                                                                                                              | 35 |
| 3.2.  | Performance of processing units with standard floating-point (IEEE 754) computation. . . . .                                                                                                                                                                                                      | 36 |
| 3.3.  | Resource utilization and power dissipation of processing units with standard floating-point (IEEE 754) computation. . . . .                                                                                                                                                                       | 38 |
| 3.4.  | Resource utilization and power dissipation of multiplier and adder floating-point (IEEE 754) operator cores. . . . .                                                                                                                                                                              | 38 |
| 3.5.  | Resource utilization and power dissipation of processing units with hybrid custom floating-point approximation. . . . .                                                                                                                                                                           | 42 |
| 3.6.  | Performance of hardware processing units with hybrid custom floating-point approximation. . . . .                                                                                                                                                                                                 | 43 |
| 3.7.  | Performance of hardware processing units with hybrid logarithmic approximation. . . . .                                                                                                                                                                                                           | 43 |
| 3.8.  | Resource utilization and power dissipation of processing units with hybrid logarithmic approximation. . . . .                                                                                                                                                                                     | 44 |
| 3.9.  | Experimental results. . . . .                                                                                                                                                                                                                                                                     | 47 |
| 3.10. | Platform implementations. . . . .                                                                                                                                                                                                                                                                 | 47 |
| 4.1.  | Resource utilization and power dissipation on the Zynq-7007S SoC. . . . .                                                                                                                                                                                                                         | 71 |
| 4.2.  | Compute performance of the CPU and TP on each Conv2D tensor operation.<br>This table presents: tensor operation, computational cost in mega floating-point operations (MFLOP), latency, throughput, power efficiency, and estimated energy consumption as the energy delay product (EDP). . . . . | 72 |
| 4.3.  | Resource utilization and power dissipation of multiplier and adder floating-point (IEEE 754) operator cores (Xilinx LogiCORE IP). . . . .                                                                                                                                                         | 72 |
| 4.4.  | Comparison of hardware implementation with related work. . . . .                                                                                                                                                                                                                                  | 77 |



# Bibliography

- [1] Heiner Lasi, Peter Fettke, Hans-Georg Kemper, Thomas Feld, and Michael Hoffmann. Industry 4.0. *Business & information systems engineering*, 6(4):239–242, 2014.
- [2] Héctor Espinoza, Gerhard Kling, Frank McGroarty, Mary O’Mahony, and Xenia Ziouvelou. Estimating the impact of the internet of things on productivity in europe. *Heliyon*, 6(5):e03935, 2020.
- [3] Vitor Alcácer and Virgilio Cruz-Machado. Scanning the industry 4.0: A literature review on technologies for manufacturing systems. *Engineering science and technology, an international journal*, 22(3):899–919, 2019.
- [4] Kou-Hung Lawrence Loh. 1.2 fertilizing aiot from roots to leaves. In *2020 IEEE International Solid-State Circuits Conference-(ISSCC)*, pages 15–21. IEEE, 2020.
- [5] Jing Zhang and Dacheng Tao. Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. *IEEE Internet of Things Journal*, 2020.
- [6] Vinay K Chippa, Srimat T Chakradhar, Kaushik Roy, and Anand Raghunathan. Analysis and characterization of inherent application resilience for approximate computing. In *Proceedings of the 50th Annual Design Automation Conference*, pages 1–9, 2013.
- [7] Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. Efficient embedded learning for iot devices. In *2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC)*, pages 308–311. IEEE, 2016.
- [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. *IEEE transactions on pattern analysis and machine intelligence*, 37(9):1904–1916, 2015.
- [9] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In *European conference on computer vision*, pages 21–37. Springer, 2016.

## Bibliography

---

- [10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks. *IEEE transactions on pattern analysis and machine intelligence*, 39(6):1137–1149, 2016.
- [11] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. *arXiv preprint arXiv:2004.10934*, 2020.
- [12] Ijaz Ahmad, Shahriar Shahabuddin, Tanesh Kumar, Erkki Harjula, Marcus Meisel, Markku Juntti, Thilo Sauter, and Mika Ylianttila. Challenges of ai in wireless networks for iot. *arXiv preprint arXiv:2007.04705*, 2020.
- [13] Fadi Al-Turjman. *Artificial intelligence in IoT*. Springer, 2019.
- [14] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In *Proceedings of the 44th annual international symposium on computer architecture*, pages 1–12, 2017.
- [15] Hussam Amrouch, Georgios Zervakis, Sami Salamin, Hammam Kattan, Iraklis Anagnos-topoulos, and Jörg Henkel. Npu thermal management. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 39(11):3842–3855, 2020.
- [16] Syed Ghayoor Abbas Gillani. Exploiting error resilience for hardware efficiency: targeting iterative and accumulation based algorithms. 2020.
- [17] Jie Han and Michael Orshansky. Approximate computing: An emerging paradigm for energy-efficient design. In *2013 18th IEEE European Test Symposium (ETS)*, pages 1–6. IEEE, 2013.
- [18] Maxence Bouvier, Alexandre Valentian, Thomas Mesquida, François Rummens, Marina Reyboz, Elisa Vianello, and Edith Beigne. Spiking neural networks hardware implementations and challenges: A survey. *ACM Journal on Emerging Technologies in Computing Systems (JETC)*, 15(2):1–35, 2019.
- [19] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In *Advances in neural information processing systems*, pages 3123–3131, 2015.
- [20] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. *arXiv preprint arXiv:1510.00149*, 2015.

- [21] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. *The Journal of Machine Learning Research*, 18(1):6869–6898, 2017.
- [22] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In *European conference on computer vision*, pages 525–542. Springer, 2016.
- [23] Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. *Advances in neural information processing systems*, 2:598–605, 1989.
- [24] Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. *Advances in neural information processing systems*, 5:164–171, 1992.
- [25] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. *arXiv preprint arXiv:1611.06440*, 2016.
- [26] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. *arXiv preprint arXiv:1608.08710*, 2016.
- [27] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. *arXiv preprint arXiv:1810.05270*, 2018.
- [28] Qian Zhang, Ting Wang, Ye Tian, Feng Yuan, and Qiang Xu. Approxann: An approximate computing framework for artificial neural network. In *2015 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, pages 701–706. IEEE, 2015.
- [29] Nicholas P Carter, Helia Naeimi, and Donald S Gardner. Design techniques for cross-layer resilience. In *2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010)*, pages 1023–1028. IEEE, 2010.
- [30] Uroš Lotrič and Patricio Bulić. Applicability of approximate multipliers in hardware neural networks. *Neurocomputing*, 96:57–65, 2012.
- [31] Zidong Du, Krishna Palem, Avinash Lingamneni, Olivier Temam, Yunji Chen, and Chengyong Wu. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In *2014 19th Asia and South Pacific design automation conference (ASP-DAC)*, pages 201–206. IEEE, 2014.

## Bibliography

---

- [32] Vojtech Mrazek, Syed Shakib Sarwar, Lukas Sekanina, Zdenek Vasicek, and Kaushik Roy. Design of power-efficient approximate multipliers for approximate artificial neural networks. In *Proceedings of the 35th International Conference on Computer-Aided Design*, pages 1–7, 2016.
- [33] Syed Shakib Sarwar, Swagath Venkataramani, Anand Raghunathan, and Kaushik Roy. Multiplier-less artificial neurons exploiting error resiliency for energy-efficient neural computing. In *2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, pages 145–150. IEEE, 2016.
- [34] Georgios Zervakis, Hassaan Saadat, Hussam Amrouch, Andreas Gerstlauer, Sri Parameswaran, and Jörg Henkel. Approximate computing for ml: State-of-the-art, challenges and visions. In *Proceedings of the 26th Asia and South Pacific Design Automation Conference*, pages 189–196, 2021.
- [35] Mark D McDonnell and Lawrence M Ward. The benefits of noise in neural systems: bridging theory and experiment. *Nature Reviews Neuroscience*, 12(7):415–425, 2011.
- [36] Udo Ernst, David Rotermund, and Klaus Pawelzik. Efficient computation based on stochastic spikes. *Neural computation*, 19(5):1313–1343, 2007.
- [37] Joel Dapello, Tiago Marques, Martin Schrimpf, Franziska Geiger, David D. Cox, and James J. DiCarlo. Simulating a primary visual cortex at the front of cnns improves robustness to image perturbations. *bioRxiv*, 2020.
- [38] David Rotermund and Klaus R. Pawelzik. Back-propagation learning in deep spike-by-spike networks. *Frontiers in Computational Neuroscience*, 13:55, 2019.
- [39] David Rotermund and Klaus R. Pawelzik. Massively parallel FPGA hardware for spike-by-spike networks. *bioRxiv*, 2019.
- [40] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. Towards spike-based machine intelligence with neuromorphic computing. *Nature*, 575(7784):607–617, 2019.
- [41] Aaron R Young, Mark E Dean, James S Plank, and Garrett S Rose. A review of spiking neuromorphic hardware communication systems. *IEEE Access*, 7:135606–135620, 2019.
- [42] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha. Truenorth: Design and tool flow of a 65 mw

- 1 million neuron programmable neurosynaptic chip. *IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems*, 34(10):1537–1557, Oct 2015.
- [43] E. Painkras, L. A. Plana, J. Garside, S. Temple, F. Galluppi, C. Patterson, D. R. Lester, A. D. Brown, and S. B. Furber. Spinnaker: A 1-w 18-core system-on-chip for massively-parallel neural network simulation. *IEEE Journal of Solid-State Circuits*, 48(8):1943–1953, Aug 2013.
- [44] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning. *IEEE Micro*, 38(1):82–99, 2018.
- [45] Yarib Nevarez, Alberto Garcia-Ortiz, David Rotermund, and Klaus R Pawelzik. Accelerator framework of spike-by-spike neural networks for inference and incremental learning in embedded systems. In *2020 9th International Conference on Modern Circuits and Systems Technologies (MOCAST)*, pages 1–5. IEEE, 2020.
- [46] David Rotermund and Klaus R. Pawelzik. Biologically plausible learning in a deep recurrent spiking network. *bioRxiv*, 2019.
- [47] Peter Dayan and Laurence F Abbott. *Theoretical neuroscience: computational and mathematical modeling of neural systems*. Computational Neuroscience Series, 2001.
- [48] Guoqiang Li, Chao Deng, Jun Wu, Xuebing Xu, Xinyu Shao, and Yuanhang Wang. Sensor data-driven bearing fault diagnosis based on deep convolutional neural networks and s-transform. *Sensors*, 19(12):2750, 2019.
- [49] Fei Dong, Xiao Yu, Enjie Ding, Shoupeng Wu, Chunyang Fan, and Yanqiu Huang. Rolling bearing fault diagnosis using modified neighborhood preserving embedding and maximal overlap discrete wavelet packet transform with sensitive features selection. *Shock and Vibration*, 2018, 2018.
- [50] Tomonori Nagayama and Billie F Spencer Jr. Structural health monitoring using smart sensors. Technical report, Newmark Structural Engineering Laboratory. University of Illinois at Urbana, 2007.
- [51] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. Deep learning for sensor-based activity recognition: A survey. *Pattern Recognition Letters*, 119:3–11, 2019.

## Bibliography

---

- [52] Yong Chan Kim, Hyeong-Geun Yu, Jae-Hoon Lee, Dong-Jo Park, and Hyun-Woo Nam. Hazardous gas detection for ftir-based hyperspectral imaging system using dnn and cnn. In *Electro-Optical and Infrared Systems: Technology and Applications XIV*, volume 10433, page 1043317. International Society for Optics and Photonics, 2017.
- [53] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. *arXiv preprint arXiv:1510.03009*, 2015.
- [54] Philip Colangelo, Nasibeh Nasiri, Eriko Nurvitadhi, Asit Mishra, Martin Margala, and Kevin Nealis. Exploration of low numeric precision deep learning inference using intel® fpgas. In *2018 IEEE 26th annual international symposium on field-programmable custom computing machines (FCCM)*, pages 73–80. IEEE, 2018.
- [55] Julian Faraone, Martin Kumm, Martin Hardieck, Peter Zipf, Xueyuan Liu, David Boland, and Philip HW Leong. Addnet: Deep neural networks using fpga-optimized multipliers. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 28(1):115–128, 2019.
- [56] Yarib Nevarez, David Rotermund, Klaus R Pawelzik, and Alberto Garcia-Ortiz. Accelerating spike-by-spike neural networks on fpga with hybrid custom floating-point and logarithmic dot-product approximation. *IEEE Access*, 2021.
- [57] Wanli Yu, Ardalan Najafi, Yarib Nevarez, Yanqiu Huang, and Alberto Garcia-Ortiz. Taac: Task allocation meets approximate computing for internet of things. In *2020 IEEE International Symposium on Circuits and Systems (ISCAS)*, pages 1–5. IEEE, 2020.
- [58] Eugene M Izhikevich. Which model to use for cortical spiking neurons? *IEEE transactions on neural networks*, 15(5):1063–1070, 2004.
- [59] Yann LeCun. The mnist database of handwritten digits. <http://yann.lecun.com/exdb/mnist/>, 1998.
- [60] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Liyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. *Pattern Recognition*, 77:354–377, 2018.
- [61] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. *Deep learning*. MIT press, 2016.
- [62] Dan Zuras, Mike Cowlishaw, Alex Aiken, Matthew Applegate, David Bailey, Steve Bass, Dileep Bhandarkar, Mahesh Bhat, David Bindel, Sylvie Boldo, et al. Ieee standard for floating-point arithmetic. *IEEE Std*, 754(2008):1–70, 2008.

- [63] Jürgen Schmidhuber. Deep learning in neural networks: An overview. *Neural networks*, 61:85–117, 2015.
- [64] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, June 2014.
- [65] Nassim Abderrahmane, Edgar Lemaire, and BenoÃ®t Miramond. Design space exploration of hardware spiking neurons for embedded artificial intelligence. *Neural Networks*, 121:366 – 386, 2020.
- [66] Maxence Bouvier, Alexandre Valentian, Thomas Mesquida, Francois Rummens, Marina Reyboz, Elisa Vianello, and Edith Beigne. Spiking neural networks hardware implementations and challenges: A survey. *J. Emerg. Technol. Comput. Syst.*, 15(2), April 2019.
- [67] Katrin Amunts, Alois C Knoll, Thomas Lippert, Cyriel MA Pennartz, Philippe Ryvlin, Alain Destexhe, Viktor K Jirsa, Egidio D’Angelo, and Jan G Bjaalie. The human brain project – synergy between neuroscience, computing, informatics, and brain-inspired technologies. *PLoS biology*, 17(7):e3000344, 2019.
- [68] Ming ZHANG, GU Zonghua, and PAN Gang. A survey of neuromorphic computing based on spiking neural networks. *Chinese Journal of Electronics*, 27(4):667–674, 2018.
- [69] Jongsun Park, Jung Hwan Choi, and Kaushik Roy. Dynamic bit-width adaptation in dct: An approach to trade off image quality and computation energy. *IEEE transactions on very large scale integration (VLSI) systems*, 18(5):787–793, 2009.
- [70] Vaibhav Gupta, Debabrata Mohapatra, Sang Phill Park, Anand Raghunathan, and Kaushik Roy. Impact: imprecise adders for low-power approximate computing. In *IEEE/ACM International Symposium on Low Power Electronics and Design*, pages 409–414. IEEE, 2011.
- [71] Sparsh Mittal. A survey of techniques for approximate computing. *ACM Computing Surveys (CSUR)*, 48(4):1–33, 2016.
- [72] Swagath Venkataramani, Srimat T Chakradhar, Kaushik Roy, and Anand Raghunathan. Approximate computing and the quest for computing efficiency. In *2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC)*, pages 1–6. IEEE, 2015.

## Bibliography

---

- [73] Bert Moons and Marian Verhelst. A 0.3–2.6 tops/w precision-scalable processor for real-time large-scale convnets. In *2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits)*, pages 1–2. IEEE, 2016.
- [74] Paul N Whatmough, Sae Kyu Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu-Yeon Wei. 14.3 a 28nm soc with a 1.2 ghz 568nj/prediction sparse deep-neural-network engine with > 0.1 timing error rate tolerance for iot applications. In *2017 IEEE International Solid-State Circuits Conference (ISSCC)*, pages 242–243. IEEE, 2017.
- [75] Xiaoyu Sun, Shihui Yin, Xiaochen Peng, Rui Liu, Jae-sun Seo, and Shimeng Yu. Xnor-rram: A scalable and parallel resistive synaptic architecture for binary neural networks. In *2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, pages 1423–1428. IEEE, 2018.
- [76] Nitin Rathi, Priyadarshini Panda, and Kaushik Roy. Stdp-based pruning of connections and weight quantization in spiking neural networks for energy-efficient recognition. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 38(4):668–677, 2018.
- [77] Sanchari Sen, Swagath Venkataramani, and Anand Raghunathan. Approximate computing for spiking neural networks. In *Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017*, pages 193–198. IEEE, 2017.
- [78] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. *The journal of machine learning research*, 15(1):1929–1958, 2014.
- [79] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In *International conference on machine learning*, pages 1058–1066, 2013.
- [80] Emre O Neftci, Bruno U Pedroni, Siddharth Joshi, Maruan Al-Shedivat, and Gert Cauwenberghs. Stochastic synapses enable efficient brain-inspired learning machines. *Frontiers in neuroscience*, 10:241, 2016.
- [81] Gopalakrishnan Srinivasan, Abhroneil Sengupta, and Kaushik Roy. Magnetic tunnel junction based long-term short-term stochastic synapse for a spiking neural network with on-chip stdp learning. *Scientific reports*, 6:29545, 2016.

- [82] Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. *PLoS Comput Biol*, 7(11):e1002211, 2011.
- [83] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. *arXiv preprint arXiv:1711.05136*, 2017.
- [84] Gregory K Chen, Raghavan Kumar, H Ekin Sumbul, Phil C Knag, and Ram K Krishnamurthy. A 4096-neuron 1m-synapse 3.8-pj/sop spiking neural network with on-chip stdp learning and sparse weights in 10-nm finfet cmos. *IEEE Journal of Solid-State Circuits*, 54(4):992–1002, 2018.
- [85] Sadique Sheik, Somnath Paul, Charles Augustine, Chinnikrishna Kothapalli, Muhammad M Khellah, Gert Cauwenberghs, and Emre Neftci. Synaptic sampling in hardware spiking neural networks. In *2016 IEEE International Symposium on Circuits and Systems (ISCAS)*, pages 2090–2093. IEEE, 2016.
- [86] M Jerry, A Parihar, B Grisafe, A Raychowdhury, and S Datta. Ultra-low power probabilistic imt neurons for stochastic sampling machines. In *2017 Symposium on VLSI Circuits*, pages T186–T187. IEEE, 2017.
- [87] Yongtae Kim, Yong Zhang, and Peng Li. An energy efficient approximate adder with carry skip for error resilient neuromorphic vlsi systems. In *2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*, pages 130–137. IEEE, 2013.
- [88] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In *Proceedings of the IEEE conference on computer vision and pattern recognition*, pages 770–778, 2016.
- [89] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. *International journal of computer vision*, 115(3):211–252, 2015.
- [90] UG585 Xilinx. Zynq-7000 all programmable soc: Technical reference manual, 2015.
- [91] James Hrica. Floating-point design with vivado hls. *Xilinx Application Note*, 2012.
- [92] Michal Lom, Ondrej Pribyl, and Miroslav Svitek. Industry 4.0 as a part of smart cities. In *2016 Smart Cities Symposium Prague (SCSP)*, pages 1–6. IEEE, 2016.

## Bibliography

---

- [93] Turker Ince, Serkan Kiranyaz, Levent Eren, Murat Askar, and Moncef Gabbouj. Real-time motor fault detection by 1-d convolutional neural networks. *IEEE Transactions on Industrial Electronics*, 63(11):7067–7075, 2016.
- [94] Olivier Janssens, Viktor Slavkovikj, Bram Vervisch, Kurt Stockman, Mia Loccufier, Steven Verstockt, Rik Van de Walle, and Sofie Van Hoecke. Convolutional neural network based fault detection for rotating machinery. *Journal of Sound and Vibration*, 377:331–345, 2016.
- [95] Osama Abdeljaber, Onur Avci, Serkan Kiranyaz, Moncef Gabbouj, and Daniel J Inman. Real-time vibration-based structural damage detection using one-dimensional convolutional neural networks. *Journal of Sound and Vibration*, 388:154–170, 2017.
- [96] Xiaojie Guo, Liang Chen, and Changqing Shen. Hierarchical adaptive deep convolution neural network and its application to bearing fault diagnosis. *Measurement*, 93:490–502, 2016.
- [97] Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, et al. Can fpgas beat gpus in accelerating next-generation deep neural networks? In *Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, pages 5–14, 2017.
- [98] Chen Wu, Mingyu Wang, Xinyuan Chu, Kun Wang, and Lei He. Low-precision floating-point arithmetic for high-performance fpga-based cnn acceleration. *ACM Transactions on Reconfigurable Technology and Systems (TRETS)*, 15(1):1–21, 2021.
- [99] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. *Advances in neural information processing systems*, 28, 2015.
- [100] Chunsheng Mei, Zhenyu Liu, Yue Niu, Xiangyang Ji, Wei Zhou, and Dongsheng Wang. A 200mhz 202.4 gflops@ 10.8 w vgg16 accelerator in xilinx vx690t. In *2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP)*, pages 784–788. IEEE, 2017.
- [101] Xiaocong Lian, Zhenyu Liu, Zhourui Song, Jiwu Dai, Wei Zhou, and Xiangyang Ji. High-performance fpga-based cnn accelerator with block-floating-point arithmetic. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 27(8):1874–1885, 2019.

- [102] Liangzhen Lai, Naveen Suda, and Vikas Chandra. Deep convolutional neural network inference with floating-point weights and fixed-point activations. *arXiv preprint arXiv:1703.03073*, 2017.
- [103] Sean O Settle, Manasa Bollavaram, Paolo D’Alberto, Elliott Delaye, Oscar Fernandez, Nicholas Fraser, Aaron Ng, Ashish Sirasao, and Michael Wu. Quantizing convolutional neural networks for low-power high-throughput inference engines. *arXiv preprint arXiv:1805.07941*, 2018.
- [104] Paolo Meloni, Antonio Garufi, Gianfranco Deriu, Marco Carreras, and Daniela Loi. Cnn hardware acceleration on a low-power and low-cost apsoc. In *2019 Conference on Design and Architectures for Signal and Image Processing (DASIP)*, pages 7–12. IEEE, 2019.
- [105] Chang Gao, Antonio Rios-Navarro, Xi Chen, Shih-Chii Liu, and Tobi Delbruck. Edgedrnn: Recurrent neural network accelerator for edge inference. *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, 10(4):419–432, 2020.
- [106] Shirsendu Sikdar, Sauvik Banerjee, and G. Ashish. Ultrasonic guided wave propagation and disbond identification in a honeycomb composite sandwich structure using bonded piezoelectric wafer transducers. *Journal of Intelligent Material Systems and Structures*, 27, 10 2015.
- [107] U. Kiencke, M. Schwarz, and T. Weickert. *Signalverarbeitung: Zeit-Frequenz-Analysen und Schätzverfahren*. Oldenbourh, 2008.
- [108] R. B. Blackman and J. W. Tukey. The measurement of power spectra from the point of view of communications engineering - part i. *Bell System Technical Journal*, 37(1):185–282, 1958.
- [109] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. *arXiv preprint arXiv:1412.6980*, 2014.
- [110] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. *arXiv preprint arXiv:1412.6806*, 2014.