



IEEE Custom Integrated Circuits Conference

## 9-2: An Energy-Efficient and Runtime-Reconfigurable FPGA-Based Accelerator for Robotic Localization Systems

*Qiang Liu<sup>1,\*</sup>, Zishen Wan<sup>2,\*</sup>, Bo Yu<sup>3,\*</sup>, Weizhuang Liu<sup>1</sup>, Shaoshan Liu<sup>3</sup>, Arijit Raychowdhury<sup>2</sup>*

*\* Equally Contributed Authors*

<sup>1</sup> *Tianjin University, China*

<sup>2</sup> *Georgia Institute of Technology, USA*

<sup>3</sup> *PerceptIn, USA*

April 25, 2022



# Bio



Email: [zishenwan@gatech.edu](mailto:zishenwan@gatech.edu)

Homepage: <https://zishenwan.github.io>

- **Speaker: Zishen Wan**

- PhD Student in Georgia Tech (20Fall-Now)
  - Advisor: Prof. Arijit Raychowdhury
- MS in Harvard University
  - Advisor: Prof. Vijay Janapa Reddi
- BS in Harbin Institute of Technology

- **Research Interest**

- VLSI, computer architecture, edge computing.
- Efficient and resilient hardware and system design for autonomous machines.

# Motivation: Autonomous Systems

Drones



Self-Driving Cars



Robots



# Motivation: Autonomous Systems

Drones



Self-Driving Cars



Robots



## Applications

Search & Rescue



Package Delivery



Surveillance



# How Does Autonomous System Work?



# How Does Autonomous System Work?



# Challenges

## Large Factor Graph:



**4000+**  
*factors*

# Challenges

## Large Factor Graph:



**4000+**  
*factors*

## Real-Time Requirement:



# Challenges

## Large Factor Graph:



**4000+ factors**

## Real-Time Requirement:



## Low Power Budget:



**Big battery**

CPU/GPU: 10-100W

# Challenges

## Large Factor Graph:



**4000+  
factors**

## Real-Time Requirement:



## Low Power Budget:



**Big battery**

CPU/GPU: 10-100W

## Dynamic Changing Environments:



**Sparse**



**Dense**

# Energy-Efficient Localization and Mapping



FPGA Zynq-7000 SoC ZC706  
with XC7Z045 FFG900-2

- Energy-efficient & real-time localization and mapping
- Dynamic reconfiguration at runtime
- Real-time performance of 61 fps at 3.45W (56mJ/frame)

# Outline

- SLAM: Simultaneously Localization & Mapping
- Hardware Architecture
- Main Contributions
- Evaluations and Comparisons
- Summary

# Outline

- SLAM: Simultaneously Localization & Mapping
- Hardware Architecture
- Main Contributions
- Evaluations and Comparisons
- Summary

# Localization and Mapping Using SLAM



# Localization and Mapping Using SLAM



# Localization and Mapping Using SLAM

Camera



IMU  
(Inertial Measure



**SLAM is computationally intensive:**

**ORB-SLAM**

FrontEnd: ORB  
BackEnd: SLAM



**LK-SLAM**

FrontEnd: LK  
BackEnd: SLAM



Localization  
(2D poses)

Mapping  
(3D coordinates)

# How Does SLAM Work?



# How Does SLAM Work?



# How Does SLAM Work?



# How Does SLAM Work?



# How Does SLAM Work?



# How Does SLAM Work?



# Outline

- SLAM: Simultaneously Localization & Mapping
- Hardware Architecture
- Main Contributions
- Evaluations and Comparisons
- Summary

# Hardware Architecture - Overview



# Hardware Architecture – Perception



**Sensor Input:**  
Camera + IMU, process in host

# Hardware Architecture – SLAM (NLS Optimization)



**SLAM Nonlinear Least Squares (NLS) Optimization:**  
Jacobian, Schur elimination, Cholesky Decomposition, etc

# Hardware Architecture – SLAM Marginalization



**SLAM Marginalization:**  
Jacobian, Schur elimination, Cholesky Decomposition, etc

# Outline

- SLAM: Simultaneously Localization & Mapping
- Hardware Architecture
- **Main Contributions**
- Evaluations and Comparisons
- Summary

# **Method 1**

Data Reuse

# Data Reuse & Design Hierarchy



2 Keyframes

3 Feature Points (F1~F3)

4 Observations (O1~O4)

# Data Reuse & Design Hierarchy



2 Keyframes  
3 Feature Points (F1~F3)  
4 Observations (O1~O4)

→

|    | O1 | O2 | O3 | O4 |
|----|----|----|----|----|
| F1 | ■  |    |    |    |
| F2 |    |    | ■  |    |
| F3 |    | ■  |    | ■  |

Jacobian Matrix

<feature point, observation>  
pairs have non-zero values

# Data Reuse & Design Hierarchy



2 Keyframes  
3 Feature Points (F1~F3)  
4 Observations (O1~O4)

A 3x4 Jacobian Matrix is shown, where rows represent feature points F1, F2, F3 and columns represent observations O1, O2, O3, O4. Non-zero values are indicated by gray shading in the matrix cells.

|    | O1 | O2 | O3 | O4 |
|----|----|----|----|----|
| F1 | ■  |    |    |    |
| F2 |    |    | ■  |    |
| F3 |    | ■  |    |    |

Jacobian Matrix

<feature point, observation>  
pairs have non-zero values



# Data Reuse & Design Hierarchy



2 Keyframes  
3 Feature Points (F1~F3)  
4 Observations (O1~O4)



<feature point, observation>  
pairs have non-zero values



## Three-Level Block Designs:

- Keyframe-level: Rotation matrix of keyframes
- Feature-level: 3D coordinates
- Observation-level: Jacobian matrix

# Data Reuse & Design Hierarchy



2 Keyframes  
3 Feature Points (F1~F3)  
4 Observations (O1~O4)



## Two-Level Data Reuses:

- Feature-reuse: across associated observations
- Keyframe-reuse: over all obsn. within keyframe

## Three-Level Block Designs:

- Keyframe-level: Rotation matrix of keyframes
- Feature-level: 3D coordinates
- Observation-level: Jacobian matrix

# Data Reuse & Design Hierarchy



2 Keyframes  
3 Feature Points (F1~F3)  
4 Observations (O1~O4)



## Two-Level Data Reuses:

- Feature-reuse: across associated observations  
→ feature (row)-stationary
- Keyframe-reuse: over all obsn. within keyframe

## Three-Level Block Designs:

- Keyframe-level: Rotation matrix of keyframes
- Feature-level: 3D coordinates
- Observation-level: Jacobian matrix

# **Method 2**

Symmetry & Sparsity

# Diagonal Computation + Symmetry + Hardware Reuse

Shure Elimination:



# Diagonal Computation + Symmetry + Hardware Reuse

Shure Elimination:



# Diagonal Computation + Symmetry + Hardware Reuse

Shure Elimination:



Make U as diagonal matrix:  
 $O(n^3) \rightarrow O(n)$  computational complexity

X becomes the transpose of W:  
1.34x on-chip memory reduction

# Diagonal Computation + Symmetry + Hardware Reuse

Shure Elimination:



Marginalization:



# Diagonal Computation + Symmetry + Hardware Reuse

Shure Elimination:



Marginalization:



# Diagonal Computation + Symmetry + Hardware Reuse

Make M as diagonal matrix:

$O(n^3) \rightarrow O(n)$  computational complexity

Reuse Schur Elimination circuit in Marginalization:

Reduce resource consumption without performance degradation

Marginalization:



# Data Layout + Symmetry + Sparsity



S matrix: store the parameters  
for the system  
(40%-80% of total storage)

720 kb

# Data Layout + Symmetry + Sparsity



# Data Layout + Symmetry + Sparsity



# Data Layout + Symmetry + Sparsity



# Data Layout + Symmetry + Sparsity



# Data Layout + Symmetry + Sparsity



## Data Layout + Symmetry + Sparsity + Co-observation

**4.1x** memory reduction

Exploiting data characteristics unique to SLAM

S matrix: store the parameters  
for the linear system  
(40%-80% of total storage)



720 kb → **4.1x reduction** → 175.97 kb

# **Method 3**

Time-Multiplex & Pipeline

# Time-Multiplexed + Pipeline Processing

Cholesky decomposition:  $S = LL^T$  ( $S$ : symmetric matrix;  $L$ : lower triangular matrix)



# Time-Multiplexed + Pipeline Processing

Cholesky decomposition:  $S = LL^T$  ( $S$ : symmetric matrix;  $L$ : lower triangular matrix)



# Time-Multiplexed + Pipeline Processing

Cholesky decomposition:  $S = LL^T$  ( $S$ : symmetric matrix;  $L$ : lower triangular matrix)



# Time-Multiplexed + Pipeline Processing

Cholesky decomposition:  $S = LL^T$  ( $S$ : symmetric matrix;  $L$ : lower triangular matrix)



# **Method 4**

**Runtime Reconfiguration & Clock Gating**

# Runtime Reconfiguration + Clock Gating



# Runtime Reconfiguration + Clock Gating



# Runtime Reconfiguration + Clock Gating



# Runtime Reconfiguration + Clock Gating



# Runtime Reconfiguration + Clock Gating

## Runtime Reconfigurable + Clock Gating:

**1.47x** power reduction in KITTI dataset

**5.75x** power reduction in EuRoC dataset

**<0.01cm** accuracy degradation



# Outline

- SLAM: Simultaneously Localization & Mapping
- Hardware Architecture
- Main Contributions
- Evaluations and Comparisons
- Summary

# Evaluation - Dataset

- EuRoC Dataset (for drone)
  - A very challenging, and widely used UAV dataset
  - 11 sequences with three categories: easy, medium & difficult
  - This work: Machine Hall sequences
- KITTI Dataset (for self-driving car)
  - A widely used autonomous driving vision benchmark
  - Task of interest: stereo, optical flow, visual odometry, 3D object detection and 3D tracking
  - This work: odometry (grayscale sequence)



# Evaluation – FPGA Platform



FPGA Zynq-7000 SoC ZC706  
with XC7Z045 FFG900-2

|                            |                        |
|----------------------------|------------------------|
| <b>Operation Frequency</b> | <b>143 MHz</b>         |
| <b>LUT</b>                 | <b>144108 (65.92%)</b> |
| <b>Flip-Flop</b>           | <b>172935 (39.56%)</b> |
| <b>BRAM</b>                | <b>268 (49.17%)</b>    |
| <b>DSP</b>                 | <b>869 (96.56%)</b>    |

# Evaluation

- Processing Latency and Energy of FPGA, CPU, and GPU



- FPGA: Xilinx Zynq-7000 SoC ZC706 @ 143 MHz
- CPU: Intel Comet Lake processor, 12 cores @ 2.9 GHz
- TX1: quad-core Arm Cortex-A57 processor @ 1.9 GHz

# Evaluation

## - Processing Latency and Energy of FPGA, CPU, and GPU



- FPGA: Xilinx Zynq-7000 SoC ZC706 @ 143 MHz
- CPU: Intel Comet Lake processor, 12 cores @ 2.9 GHz
- TX1: quad-core Arm Cortex-A57 processor @ 1.9 GHz

# Evaluation

## - Processing Latency and Energy of FPGA, CPU, and GPU



- FPGA: Xilinx Zynq-7000 SoC ZC706 @ 143 MHz
- CPU: Intel Comet Lake processor, 12 cores @ 2.9 GHz
- TX1: quad-core Arm Cortex-A57 processor @ 1.9 GHz

# Evaluation

## - Processing Latency and Energy of FPGA, CPU, and GPU



| EuRoC Dataset<br>(For drone)          | FPGA Speedup |          | FPGA Energy Reduction |          |
|---------------------------------------|--------------|----------|-----------------------|----------|
|                                       | Over CPU     | Over TX1 | Over CPU              | Over TX1 |
| FPGA ZC706                            | 8.73x        | 70.10x   | 164.40x               | 40.84x   |
| Kintex-7 Series<br>(XC7K160tffg484)   | 7.01x        | 56.30x   | 180.73x               | 44.90x   |
| Virtix-7 Series<br>(XC7VX690tffg1761) | 10.75x       | 86.34x   | 172.05x               | 42.75x   |

| KITTI Dataset<br>(For car)            | FPGA Speedup |          | FPGA Energy Reduction |          |
|---------------------------------------|--------------|----------|-----------------------|----------|
|                                       | Over CPU     | Over TX1 | Over CPU              | Over TX1 |
| FPGA ZC706                            | 10.49x       | 45.48x   | 182.88x               | 24.51x   |
| Kintex-7 Series<br>(XC7K160tffg484)   | 8.27x        | 35.82x   | 196.09x               | 26.28x   |
| Virtix-7 Series<br>(XC7VX690tffg1761) | 12.71x       | 55.08x   | 188.60x               | 25.28x   |

# Evaluation

## - Comparison with Related Work

|                              | This work                                       | ISSCC'19<br>CNN-SLAM [1]                        | JSSC'19<br>Navion [2]                       | TC'20<br>pi-BA [3]                              | RSS'17<br>VIO on Chip [4]                   | HPCA'21<br>Eudoxus [5]             |
|------------------------------|-------------------------------------------------|-------------------------------------------------|---------------------------------------------|-------------------------------------------------|---------------------------------------------|------------------------------------|
| Platform                     | FPGA                                            | ASIC                                            | ASIC                                        | FPGA                                            | FPGA                                        | FPGA                               |
| Technology                   | 28 nm                                           | 28 nm                                           | 65 nm                                       | 28nm                                            | 28nm                                        | 16nm                               |
| Design                       | digital                                         | digital                                         | digital                                     | digital                                         | digital                                     | digital                            |
| Type                         | SLAM                                            | SLAM                                            | SLAM                                        | SLAM                                            | SLAM                                        | SLAM                               |
| Algorithm                    | Levenberg-<br>Marquardt<br>(optimization-based) | Levenberg-<br>Marquardt<br>(optimization-based) | Gaussian-<br>Newton<br>(optimization-based) | Levenberg-<br>Marquardt<br>(optimization-based) | Gaussian-<br>Newton<br>(optimization-based) | Kalman<br>Filter<br>(Filter-based) |
| DoF                          | 6-DoF                                           | 6-DoF                                           | 6-DoF                                       | 6-DoF                                           | 6-DoF                                       | 6-DoF                              |
| Voltage                      | 1 V                                             | 0.63-0.9V                                       | 1.2V                                        | 1 V                                             | 1 V                                         | 0.85 V                             |
| Power                        | 3.45W                                           | 243.6mW @ 0.9V<br>61.75mW @ 0.63V               | 24mW                                        | 5.50W                                           | 1.46 W                                      | 8.96W                              |
| Frequency                    | 143 MHz                                         | 240 MHz                                         | 62.5/83.3 MHz                               | 143 MHz                                         | 100 MHz                                     | 180 MHz                            |
| Throughput                   | 55.8 GOPS                                       | 879.6 GOPS @ 0.9V<br>329.8 GOPS @ 0.63V         | 10.5-59.1 GOPS                              | N/A                                             | 4.4-24.6 GOPS                               | N/A                                |
| Latency                      | 16.43 ms                                        | N/A                                             | 30.8 ms                                     | 110 ms                                          | 200 ms                                      | 44.6 ms                            |
| Energy<br>per Frame          | 56.6 mJ                                         | N/A                                             | 739.2 $\mu$ J                               | 605 mJ                                          | 292 mJ                                      | 399.6 mJ                           |
| Dynamic<br>Optimiza-<br>tion | Yes                                             | N/A                                             | N/A                                         | No                                              | No                                          | No                                 |

# Outline

- SLAM: Simultaneously Localization & Mapping
- Hardware Architecture
- Main Contributions
- Evaluations and Comparisons
- Summary

# Summary

- **Energy-efficient** and **runtime-reconfigurable** FPGA accelerator for robotic localization and mapping.

# Summary

- **Energy-efficient** and **runtime-reconfigurable** FPGA accelerator for robotic localization and mapping.
- Leverage data sparsity, locality, and parallelism inherent in localization.
  - **4.1x** memory reduction with symmetry and sparsity
  - **5.7x** compute time reduction with time-multiplexed and pipeline processing
  - **5.8x** power reduction with runtime reconfiguration and clock gating

# Summary

- **Energy-efficient** and **runtime-reconfigurable** FPGA accelerator for robotic localization and mapping.
- Leverage data sparsity, locality, and parallelism inherent in localization.
  - **4.1x** memory reduction with symmetry and sparsity
  - **5.7x** compute time reduction with time-multiplexed and pipeline processing
  - **5.8x** power reduction with runtime reconfiguration and clock gating
- Our design is **2 orders of magnitude** more energy efficient than CPU and GPU.

# Reference



[Wan, CICC 2022]

# Reference



[Wan, Synthesis Lectures on Comp Arch 2021]



[Wan, CICC 2022]



[Wan, Circuits and Systems Magazine 2021]

**THANK YOU**

OBRIGADO  
gracias  
どうも  
ARIGATO  
grazas  
GRAZZI  
THANKS  
dijan  
TAK  
PALDIES  
danke  
DANK U  
OBRIGADO  
mesi  
감사합니다  
köszü  
благодаря  
на gode  
hvala  
DANK U  
takk  
MERSI  
merci  
謝謝  
TEŞEKKÜR EDERIM  
MOLTE GRAZIE  
GO RAIBH MAITH AGAT  
danke schön  
ありがとう  
TEŞEKKÜR EDERIM  
MOLTE GRAZIE  
GO RAIBH MAITH AGAT  
благодаря  
TAK  
どうも  
muchas gracias  
vielen dank  
DZIEKI  
TACK  
Gràcies  
TEŞEKKÜR EDERIM  
NA GODE  
muchas gracias  
спасибо  
obrigado  
شکراً  
多謝