

NOTE:

The following slides are for reference by HotChips 2023 registered attendees only.



IBM NorthPole Neural Inference Machine  
Dr. Dharmendra S. Modha  
[dmodha@us.ibm.com](mailto:dmodha@us.ibm.com)  
IBM Research – Almaden, San Jose, CA  
August 29, 2023

Dharmendra S. Modha\*, Filipp Akopyan†,  
Alexander Andreopoulos†,  
Rathinakumar Appuswamy†, John V. Arthur†,  
Andrew S. Cassidy†, Pallab Datta†,  
Michael V. DeBoer†, Steven K. Essert,  
Carlos Ortega Otero†, Jun Sawada†, Brian Tabat†,  
Arnon Amir, Deepika Bablani, Peter J. Carlson,  
Myron D. Flickner, Rajamohan Gandhasri,  
Guillaume J. Garreau, Megumi Ito,  
Jennifer L. Klamo, Jeffrey A. Kusnitz,  
Nathaniel J. McClatchey, Jeffrey L. McKinstry,  
Yutaka Nakamura, Tapan K. Nayak,  
William P. Risk, Kai Schleupen, Ben Shaw,  
Jay Sivagnanam, Daniel F. Smith,  
Ignacio Terrizzano, Takanori Ueda



IBM Research

\*Corresponding author. [dmodha@us.ibm.com](mailto:dmodha@us.ibm.com)

†These authors contributed equally to this work.

This material is based upon work supported by the United States Air Force under Contract No. FA8750-19-C-1518. Support from OUSD(R&E) is gratefully acknowledged.

Why NorthPole?  
Energy- and Space-efficiency















ResNet-50

Copyright ©2023 IBM Corporation







Why NorthPole?  
Latency



ResNet-50

Latency (ms)

Copyright ©2023 IBM Corporation





# Context



# TrueNorth

> 2015 > 2016 > 2017 > 2018 > 2019 > 2020 > 2021 > 2022 > 2023

NorthPole in stealth mode since 2015



## TrueNorth



## NorthPole unveiled

# NorthPole Architecture

## Core-based design

### Vector Matrix Multiplication (VMM)

- 8-, 4-, 2- bit precision
- 2048-4096-8192 Ops/cycle
- Mixed precision—right precision for each layer



### Vector Compute Unit

- 256 Op/cycle
- FP16 precision

### Activation Function Unit (not shown)

- 32 Op/cycle
- FP16 precision

Fully pipelined operation

## Memory-near-compute

Weight buffer  
is near VMM

Partial Sum Buffer  
is near Vector Compute Unit

768KB / core of unified memory

- weights (model)
- program
- neural activations



## Concurrent, distributed control

Eight threads per core

- No VLIW

Fully prescheduled operation in the core array

- deterministic, predictable, verifiable
- no data-dependent conditional branching (breaking path with Turing's idea of conditional branching)
- no cache misses
- no stalls
- no speculative execution



# Compute-intertwined-with-memory



▲  
Schematic



# Four Networks-on-chip (NoC)

Unify distributed compute, memories

1. Activation NoC (ANoC) used between layers to reorganize neural activations

2. Partial Sum NoC (PNoC) used within a layer for neighboring cores to communicate – spatial computing



# Four Networks-on-chip (NoC)

Unify distributed compute, memories

1. Activation NoC (ANoC) used between layers to reorganize neural activations

2. Partial Sum NoC (PNoC) used within a layer for neighboring cores to communicate – spatial computing

3. Model NoC (MNoC) delivers weights during layer execution

4. Instruction NoC (INoC) delivers program for each layer prior to layer start

MNoC/INoC enable reconfigurability – key to bridging brain-inspired computing with silicon



## Dense interconnectivity

4,096 wires criss-cross each core in both dimensions



# Distributed, modular core array

Cortex-like modularity enables homogeneous scalability in two-dimensions



# 12nm Silicon Implementation



# 12nm Silicon Implementation

16×16 array of cores  
Massive parallelism



# 12nm Silicon Implementation

16×16 array of cores

Massive parallelism

192MB of memory  
for activations, model, program  
distributed among cores



# 12nm Silicon Implementation

16×16 array of cores

Massive parallelism

192MB of memory  
for activations, model, program  
distributed among cores

32MB framebuffer for IO tensors



# 12nm Silicon Implementation

16×16 array of cores

Massive parallelism

192MB of memory  
for activations, model, program  
distributed among cores

32MB framebuffer for IO tensors

800mm<sup>2</sup> area, 22 Billion transistors

Functional in first-silicon



Compute

Memory

GPU (A100)



TPU



CPU (Zen 3)

1 billion  
transistors

NorthPole



NorthPole has no centralized memory, no off-chip memory,  
no von Neumann bottleneck



NorthPole has a simple usage/IO model: write tensor, run, read tensor – essentially an active memory  
 Entire network is on-chip, No layer-by-layer interaction, Minimum load on the host, Minimum IO bandwidth

# Achieving State-of-the-art Accuracy with Mixed Precision

## Quantization-aware-training to maximize accuracy via PyTorch extension

Two algorithms:

FAQ: Finetuning After Quantization

LSQ: Learned Step-size Quantization

Selecting network layer precision to maximize accuracy / throughput

Two algorithms:

EAGL: Entropy Approximation Guided Layer selection

ALPS: Accuracy-aware Layer Precision Selection

| Network                                     | Method           | Precision (w.a.) | Accuracy             | Accuracy   | Top-1 Accuracy @ Precision          |      |
|---------------------------------------------|------------------|------------------|----------------------|------------|-------------------------------------|------|
| <b>Demonstrated on classification tasks</b> |                  |                  |                      |            |                                     |      |
| ResNet-18                                   | Apprentice       | 4.8              | 70.40                | 89.32      | Top-1 Accuracy @ Precision<br>3 4 8 |      |
| ResNet-18                                   | FAQ (This paper) | 8.8              | 70.02                | 89.32      | Full precision: 70.5                |      |
| ResNet-18                                   | FAQ (This paper) | 4.4              | 69.78±0.04           | 89.11±0.03 |                                     |      |
| ResNet-18                                   | Joint Training   | -                | -                    | -          |                                     |      |
| ResNet-18                                   | UNIQ             | 4.8              | 67.02                | -          |                                     |      |
| ResNet-18                                   | Distillation     | 4.32             | 64.20                | -          |                                     |      |
| ResNet-34                                   | baseline         | 32.32            | 73.30                | 91.42      |                                     |      |
| ResNet-34                                   | FAQ (This paper) | 8.8              | 73.71                | 91.63      |                                     |      |
| ResNet-34                                   | FAQ (This paper) | 4.4              | 73.31                | 91.32      |                                     |      |
| ResNet-34                                   | UNIQ             | 4.32             | 73.1                 | -          |                                     |      |
| ResNet-34                                   | Apprentice       | 4.8              | 73.1                 | -          |                                     |      |
| ResNet-34                                   | UNIQ             | 4.8              | 71.09                | -          |                                     |      |
| ResNet-50                                   | baseline         | 32.32            | 76.15                | 92.87      |                                     |      |
| ResNet-50                                   | FAQ (This paper) | 8.8              | 76.52                | 93.09      |                                     |      |
| ResNet-50                                   | FAQ (This paper) | 4.4              | 76.27                | 92.89      |                                     |      |
| ResNet-50                                   | EAGL             | 4.4              | 75.9                 | 92.4       |                                     |      |
| ResNet-50                                   | IOA              | 8.8              | 74.9                 | -          |                                     |      |
| ResNet-50                                   | Apprentice       | 4.8              | 74.7                 | -          |                                     |      |
| ResNet-50                                   | UNIQ             | 4.8              | 73.37                | -          |                                     |      |
| ResNet-152                                  | baseline         | 32.32            | 78.31                | 94.06      |                                     |      |
| ResNet-152                                  | FAQ (This paper) | 4.4              | 78.64                | 94.12      |                                     |      |
| ResNet-152                                  | FAQ (This paper) | 8.8              | 78.54                | 94.07      |                                     |      |
| Inception-v3                                | baseline         | 32.32            | 77.45                | 93.56      |                                     |      |
| Inception-v3                                | FAQ (This paper) | 8.8              | 77.60                | 93.59      |                                     |      |
| Inception-v3                                | FAQ (This paper) | 4.4              | 77.33                | 93.59      |                                     |      |
| Inception-v3                                | IOA              | 8.8              | 74.2                 | 92.2       |                                     |      |
| DenseNet-161                                | baseline         | 32.32            | 77.65                | 93.80      |                                     |      |
| DenseNet-161                                | FAQ (This paper) | 4.4              | 77.90                | 93.83      |                                     |      |
| DenseNet-161                                | FAQ (This paper) | 8.8              | 77.84                | 93.91      |                                     |      |
| VGG-16bn                                    | baseline         | 32.32            | 73.36                | 91.50      |                                     |      |
| VGG-16bn                                    | FAQ (This paper) | 4.4              | 73.87                | 91.67      |                                     |      |
| VGG-16bn                                    | FAQ (This paper) | 8.8              | 73.66                | 91.56      |                                     |      |
| Squeeze                                     | Next-23-2x       | LSQ (Ours)       | 53.3                 | 63.7       | 67.4                                | 67.0 |
|                                             |                  |                  | Full precision: 67.3 |            |                                     |      |



| Method      | ResNet-50        | PSPNet         |
|-------------|------------------|----------------|
| EAGL (Ours) | 3.15 CPU seconds | <1 CPU minute  |
| ALPS (Ours) | 166 GPU hours    | 67 GPU hours   |
| HAWQ-v3     | 2 GPU hours      | 1032 GPU hours |

Low Precision ≠ Low fidelity

# A Growing List of Implementable Networks

Networks:



















Transformer models supportable by potential multi-chip NorthPole. A64 (A128) indicates model as run on NorthPole in 64 (128) MB activation memory / 128 (64) MB weight memory configuration.

|                               |                         |                       |                                |                          |                        |
|-------------------------------|-------------------------|-----------------------|--------------------------------|--------------------------|------------------------|
| AA. albert-large-v1-2b(A64)   | AN. gpt2-Medium-8b(A64) | BA. m2m100-8b(A64)    | BN. t5-large-4b(A64)           | CA. BertLarge-8b(A128)   | CN. m2m100-8b(A128)    |
| AB. albert-large-v1-4b(A64)   | AO. gpt2-Large-2b(A64)  | BB. mt5-base-2b(A64)  | BO. t5-large-8b(A64)           | CB. gpt2-medium-2b(A128) | CO. mt5-base-2b(A128)  |
| AC. albert-large-v1-8b(A64)   | AP. gpt2-Large-4b(A64)  | BC. mt5-base-4b(A64)  | BP. xglm-2b(A64)               | CC. gpt2-medium-4b(A128) | CP. mt5-base-4b(A128)  |
| AD. albert-xlarge-v1-2b(A64)  | AQ. gpt2-Large-8b(A64)  | BD. mt5-base-8b(A64)  | BO. xglm-4b(A64)               | CD. gpt2-medium-8b(A128) | CO. mt5-xl-2b(A128)    |
| AE. albert-xlarge-v1-4b(A64)  | AR. gpt2-XL-2b(A64)     | BE. mt5-large-2b(A64) | BR. xglm-8b(A64)               | CE. gpt2-large-2b(A128)  | CR. t5-base-2b(A128)   |
| AF. albert-xlarge-v1-8b(A64)  | AS. gpt2-XL-4b(A64)     | BF. mt5-large-4b(A64) | BS. albert-large-v1-2b(A128)   | CF. gpt2-large-4b(A128)  | CS. t5-base-4b(A128)   |
| AG. albert-xxlarge-v1-2b(A64) | AT. gpt2-XL-8b(A64)     | BG. mt5-large-8b(A64) | BT. albert-large-v1-4b(A128)   | CG. gpt2-large-8b(A128)  | CT. t5-base-8b(A128)   |
| AH. albert-xxlarge-v1-4b(A64) | AU. gpt-neo-2b(A64)     | BH. mt5-xl-2b(A64)    | BU. albert-large-v1-8b(A128)   | CH. gpt2-xl-2b(A128)     | CU. t5-large-2b(A128)  |
| AI. BertLarge-2b(A64)         | AV. gpt-neo-4b(A64)     | BI. mt5-xl-4b(A64)    | BV. albert-xlarge-v1-2b(A128)  | CI. gpt2-xl-4b(A128)     | CV. t5-large-4b(A128)  |
| AJ. BertLarge-4b(A64)         | AW. gpt-neo-8b(A64)     | BJ. t5-base-2b(A64)   | BW. albert-xlarge-v1-4b(A128)  | CJ. gpt-neo-2b(A128)     | CW. t5-large-8b(A128)  |
| AK. BertLarge-8b(A64)         | AX. llama-2b(A64)       | BK. t5-base-4b(A64)   | BX. albert-xxlarge-v1-2b(A128) | CK. gpt-neo-4b(A128)     | CX. xglm-564M-2b(A128) |
| AL. gpt2-Medium-2b(A64)       | AY. m2m100-2b(A64)      | BL. t5-base-8b(A64)   | BY. BertLarge-2b(A128)         | CL. m2m100-2b(A128)      | CY. xglm-564M-4b(A128) |
| AM. gpt2-Medium-4b(A64)       | AZ. m2m100-4b(A64)      | BM. t5-large-2b(A64)  | BZ. BertLarge-4b(A128)         | CM. m2m100-4b(A128)      | CZ. xglm-564M-8b(A128) |

Networks can be implemented from a large set of possibilities  
as more matrix multiplication primitives are added to the software toolchain

# NorthPole End-to-end Toolchain



 **podman**    **OPENSIFT**  
OPENSIFT

Container preinstalled with SDK

### Train Flow (Offline)



PyTorch API to adapt network for NorthPole and train on GPU



Compiler to export network to hardware-ready model

### Run Flow



Runtime API to deploy model on NorthPole



Validator to emulate NorthPole in software

Example applications with full source code, pretrained networks

# NorthPole Systems



(Research Prototype)

FPGA is only used for PCIe bridge  
No HBM/external memory, No CPU Cores



NorthPole PCIe assembly (Research Prototype)



Single NorthPole assembly in a 1U server (Research Prototype)

To scale-out, a model can be striped across chips,  
increasing FPS and parameter memory  
while keeping energy-, space-, and latency-efficiencies,  
with only low-bandwidth data tensors moving via PCIe



Four NorthPole assemblies in a server (Research Prototype)  
... 8, 10, 12, 16 assemblies in a server are possible



Thank you!

dmodha@us.ibm.com

- NorthPole ... is specialized to inference
  - ... performs at the frontier of energy, space, and time
  - ... can support many deep networks in vision, speech, and natural language
  - ... has brain-inspired and silicon-optimized architecture
  - ... has modular, tileable architecture – like the cortex
  - ... has massive parallelism – like the cortex
  - ... has mixed-precision – like the cortex
  - ... has memory-near-compute – like the cortex
  - ... has no off-chip / centralized memory and no von Neumann bottleneck – like the cortex
  - ... has only three commands: write tensor(s), run network, read tensor(s) – is an active memory
  - ... has minimum IO bandwidth requirement
  - ... has minimum load on the host
  - ... has two dense brain-inspired networks-on-chip
  - ... has two dense silicon-optimized networks-on-chip
  - ... has no VLIW
  - ... has pre-scheduled, deterministic operation in the core array free from cache-misses
  - ... has unscheduled, input-driven operation in the framebuffer for queuing and isolation
  - ... has co-designed mixed-precision training algorithms
  - ... has an end-to-end software toolchain
  - ... has a current PCIe implementation with many possible custom boards
  - ... has an easy scale-out implementation
  - ... has significant headroom in terms of system scaling, silicon scaling, architecture innovations

# Notes on BERT-base Performance Comparison

1. Comparative approaches use a sequence length of 128. Their performance metrics are scaled by a factor of 3x, scaled to the compute required for a sequence length of 384.
2. The 3x scaling factor from sequence length of 128 to 384 was validated based on A100 GPU performance numbers on BERT-large, which are reported for both sequence lengths.
3. This is a reasonable and a conservative upper bound, as the compute and communication required by the network scale by a factor of 3x.
4. It does not account for the fact that the longer sequence network may not fit in chip memories sized for the shorter sequence length, or similar caching effects. This would lead to scaling worse than a factor of 3x.