



<http://synergy.ece.gatech.edu>



# Enabling Continuous Learning through Synaptic Plasticity in Hardware

Tushar Krishna

Georgia Tech

EMC<sup>2</sup> Workshop  
June 23 2019

# The Dream!



# Deep Learning Applications

“AI is the new electricity” – Andrew Ng

Object Detection



Image Segmentation



Medical Imaging



Speech Recognition



Text to Speech



Recommendations



Games



# Deep Learning Landscape



# Deep Learning Landscape



# Deep Learning Landscape



# Computation Platforms



# Efficiency of Deep Learning Systems



# What is Continuous Learning?



Become better and faster  
with experience

Learn new tasks



Compute and  
energy-efficiency

Can we leverage  
Supervised Deep  
Learning?

# Deep Learning Landscape



# Efficiency of Continuous Learning Systems



# Outline of Talk



Genesys

MAERI

# Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, *GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware*, MICRO 2018



- Continuous Learning Template

- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

# Continuous Learning in Brains



Constant synapse formation and pruning

# Template for Continuous Learning



# Conventional RL: Challenges

Deep NNs used internally

- ! Manual hyperparameter tuning

Each update results in **Backpropagation**

- ! High compute requirement at every update
- ! High memory overhead
- ! Not scalable

**Not viable for continuous  
learning on the edge**

# Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, *GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware*, MICRO 2018



- Continuous Learning` Template
- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

# Neuro-Evolutionary (NE) Algorithm



Neural Network (NN) expressed as a graph

**Gene:** Vertex or Edge  
in the graph

**Genome:** Collection of all  
genes (i.e., a NN)

[1] Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. *Evolutionary computation*, 10(2), 99-127.

# Neuro-Evolutionary (NE) Algorithm



Neural Network (NN) expressed as a graph

**NeuroEvolution of Augmented Topologies (NEAT) [1]**

**Gene:** Vertex or Edge  
in the graph

**Genome:** Collection of all  
genes (i.e., a NN)

[1] Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. *Evolutionary computation*, 10(2), 99-127.

# Properties of NE algorithms

## Algorithmic

Robustness



No Training



Change fitness function



Accuracy?

## Systems

Too much compute!

Convergence time?

déjà vu! Looks like Deep Neural Networks in the 90s



Eyeriss



GPU



FPGA

HW solutions enabled  
Deep Learning

*Can we do the same with EA?*

# Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, *GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware*, MICRO 2018



DNN Architecture

How to autonomously design DNN models for continuous learning?

**Genesys**

- Continuous Learning
- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

# Characterization of NEAT



Codebase



Environments



Ran each environment till convergence, multiple times

Only changed fitness function between workloads

# Characterization of NEAT

## Computations



**Inference:**  
Population level parallelism (PLP)

**Evolution:**  
Gene level parallelism (GLP)

## Distribution of Operations/Generation



All operations are independent



Large operation level Parallelism

# Operations in NEAT

## Crossover



## Evolution

## Mutation



## Addition mutation

- Add new node
- Add new connection

## Inference



## MAC



## Activation

Simple operations

# Characterization of NEAT

## Memory



## Distribution of Memory footprint/Generation



Entire population can fit on-chip

Only need to store the weights and node info

# Characterization of NEAT

## Memory

### Opportunity for Reuse



Fittest parent genome is used about ~10-20 times each generation

Even higher in certain cases

### Distribution of Memory footprint/Generation



125KB

<1MB

Entire population can fit on-chip

Only need to store the weights and node info

# Properties of NE algorithms

## Algorithmic

Robustness

No Training



Change fitness function



## Systems

Massive Parallelism

Low Memory Footprint

Genomes within Population

Only store genomes in current generation

Genes within a Genome

No backprop

Simple HW-friendly Ops

No gradient calculations or storage

MACs in Inference  
Crossover and Mutation in Evolution

**HW-SW Co-Design of NE makes them viable for continuous learning on edge**

# Motivating Hardware Solution



# Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, *GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware*, **MICRO 2018**



How to autonomously update DNN models for continuous learning?

**Genesys**

- Continuous Learning
- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

# GeneSys SoC



# Evolution Engine: EvE Microarchitecture



# PE Microarchitecture



**Evolution Engine (EVE)**

**Genome:** Neural Network  
**Gene:** Node or Connection  
**Population Size = n**



**Details of pipeline stages  
in the paper**

# Inference Engine: ADAM Microarchitecture



Conventional DNN  
Inference Accelerator

Exploit Population Level  
Parallelism

Networks generated by  
NEAT are irregular (thus  
sparse)

Details later in  
talk!

# Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, *GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware*, **MICRO 2018**



How to autonomously update DNN models for continuous learning?

**Genesys**

- Continuous Learning
- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

# Implementation

## GeneSys Parameters

|                     |                      |
|---------------------|----------------------|
| <b>Tech node</b>    | 15nm                 |
| <b>Num EvE PE</b>   | 256                  |
| <b>Num ADAM PE</b>  | 1024                 |
| <b>EvE Area</b>     | 0.89 mm <sup>2</sup> |
| <b>ADAM Area</b>    | 0.25 mm <sup>2</sup> |
| <b>GeneSys Area</b> | 2.45 mm <sup>2</sup> |
| <b>Power</b>        | 947.5 mW             |
| <b>Frequency</b>    | 200 MHz              |
| <b>Voltage</b>      | 1.0 V                |
| <b>SRAM banks</b>   | 48                   |
| <b>SRAM depth</b>   | 4096                 |



# Evaluations

---

| Legend  | Inference | Evolution | Platform        |
|---------|-----------|-----------|-----------------|
| CPU_a   | Serial    | Serial    | 6th gen i7      |
| CPU_b   | PLP       | Serial    | 6th gen i7      |
| GPU_a   | BSP       | PLP       | Nvidia GTX 1080 |
| GPU_b   | BSP + PLP | PLP       | Nvidia GTX 1080 |
| CPU_c   | Serial    | Serial    | ARM Cortex A57  |
| CPU_d   | PLP       | Serial    | ARM Cortex A57  |
| GPU_c   | BSP       | PLP       | Nvidia Tegra    |
| GPU_d   | BSP + PLP | PLP       | Nvidia Tegra    |
| GENESYS | PLP       | PLP + GLP | GENESYS         |

PLP (GLP) - Population (Gene) Level Parallelism

BSP - Bulk Synchronous Parallelism (GPU)

# Evaluations: Energy



# Evaluations: Runtime



Faster convergence

# Summary for GeneSys

- Robust, Scalable and Energy efficient solutions needed for continuous learning
  - Look beyond DL and RL
- NEs offer promise
  - Parallelism
  - Low-memory Footprint
  - HW friendly
- GeneSys: *100x – 10000x energy efficiency and performance*
  - More deployable compute
  - Enables AI solutions for a large gamut of problems

# Outline of Talk



# Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation
- MAERI
  - Abstraction
  - Implementation
  - Operation Example
  - Mapping Strategies
- Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna  
**MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects:**

**ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention**



# Myriad Dataflows in DNN Accelerators

- **DNN Topologies**
  - Layer size / shape
  - Layer types: Convolution / Pool / FC / LSTM
  - New sub-structure: e.g., Inception in Googlenet
- **Compiler/Mapper**
  - Loop Scheduling
    - Reordering and Tiling
  - Mapping
    - Output/Weight/Input/Row-stationary
- **Algorithmic Optimization (e.g., Sparsity)**
  - Weight pruning
  - GeneSys



Can we have one architectural solution that can handle arbitrary dataflows and provides ~100% utilization?

# What is the computation in a DNN?

## Independent multiplication



Our Key insight: Each DNN/dataflow translates into neurons of different sizes

# Irregular Dataflow: Pruning

## Example: Weight Pruning (Sparse Workload)



Our Key insight: Each DNN/dataflow translates into neurons of different sizes

# Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation
- MAERI
  - Abstraction
  - Implementation
  - Operation Example
  - Mapping Strategies
- Evaluations



# The MAERI Abstraction



Multiplier Pool



VN0    VN1    VN2

Adder Pool



**Virtual Neuron (VN):** Temporary grouping of compute units for an output

*How to enable flexible grouping?*

**Need flexible connectivity!**

# Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation
- MAERI
  - Abstraction
  - Implementation
  - Operation Example
  - Mapping Strategies
- Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna  
**MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects:**

**ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention**



Microarchitecture

How to design an efficient accelerator for changing DNN models

**MAERI**

# The MAERI Implementation



Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna  
**MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects:**  
*ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention*

# The MAERI Implementation



# The MAERI Implementation



# The MAERI Implementation



# The MAERI Implementation



# The MAERI Implementation



# Orientation



# Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation
- MAERI
  - Abstraction
  - Implementation
  - Operation Example
  - ▶ • Mapping Strategies
- Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna  
**MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects:**  
**ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention**



# Example: Computing a CONV layer

- **[Communication]** Distribute weights and inputs (image pixels) to multiplier switches
  - *Assume: weight stationary, conv reuse of inputs via local links*
- **[Computation]** Compute partial sums
- **[Computation]** Reduce partial sums
- **[Communication]** Collect outputs to buffer

# MAERI Operation Example

*Sparse Weight Filter*

|          |          |          |
|----------|----------|----------|
| $W_{00}$ | $W_{01}$ | $W_{02}$ |
| $W_{10}$ | $W_{11}$ | 0        |



**Filter**

**Slides**

|          |          |          |          |
|----------|----------|----------|----------|
| $X_{00}$ | $X_{01}$ | $X_{02}$ | $X_{03}$ |
| $X_{10}$ | $X_{11}$ | $X_{12}$ | $X_{13}$ |
| $X_{20}$ | $X_{21}$ | $X_{22}$ | $X_{23}$ |
| $X_{30}$ | $X_{31}$ | $X_{32}$ | $X_{33}$ |

**Input Activation**

|          |          |          |          |
|----------|----------|----------|----------|
| $O_{00}$ | $O_{01}$ | $O_{02}$ | $O_{03}$ |
| $O_{10}$ | $O_{11}$ | $O_{12}$ | $O_{13}$ |
| $O_{20}$ | $O_{21}$ | $O_{22}$ | $O_{23}$ |
| $O_{30}$ | $O_{31}$ | $O_{32}$ | $O_{33}$ |

**Output Activation**

$$O_{00} = \left[ \begin{array}{c} W_{00} \\ X_{00} \end{array} \right] * \left[ \begin{array}{c} W_{01} \\ X_{01} \end{array} \right] * \left[ \begin{array}{c} W_{02} \\ X_{01} \end{array} \right] + \left[ \begin{array}{c} W_{10} \\ X_{10} \end{array} \right] * \left[ \begin{array}{c} W_{11} \\ X_{11} \end{array} \right]$$

$$+ \left[ \begin{array}{c} W_{00} \\ X_{00} \end{array} \right] * \left[ \begin{array}{c} W_{01} \\ X_{01} \end{array} \right] * \left[ \begin{array}{c} W_{02} \\ X_{01} \end{array} \right]$$

# MAERI Operation Example



# MAERI Operation Example



# MAERI Operation Example



# MAERI Operation Example



# MAERI Operation Example



# MAERI Operation Example



# Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation
- MAERI
  - Abstraction
  - Implementation
  - Operation Example
  - Mapping Strategies
- Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna  
**MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects:**  
**ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention**



# Example Mapping – Dense CNN

Our Key insight: Each DNN/dataflow translates into neurons of different sizes



# Example Mapping – Sparse DNN

Our Key insight: Each DNN/dataflow translates into neurons of different sizes



# Example Mapping – LSTM/FC

Our Key insight: Each DNN/dataflow translates into neurons of different sizes



# Searching optimal dataflows for MAERI



# Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation
- MAERI
  - Abstraction
  - Implementation
  - Operation Example
  - Mapping Strategies
- Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna  
**MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects:**

**ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention**



# End-to-End Performance



# Energy with Convolution Layers



\* Normalized to MAERI energy with Alexnet C1

MAERI reduces energy upto 57% and 28% in average compared to Row-Stationary (dense dataflow) and 7.1% in average compared to Systolic Array (sparse dataflow)

# Summary of MAERI

- DNN models evolving rapidly
  - Multiple layer types
  - Sparsity Optimizations
  - Myriad dataflows for scheduling and mapping
- MAERI enables dynamic grouping of arbitrary number of MACCs (“Virtual Neuron”) via reconfigurable, non-blocking interconnects, providing
  - Future proof to DNN models and dataflows
  - Near 100% compute unit utilization

# Takeaways



Thank you!

<http://synergy.ece.gatech.edu>