



# **Introduction to FPGA, FINN and Brevitas**

**Dr. Mario Ruiz**  
**AMD University Program**

# AUP Vision

Empower academics with AMD technology to enhance teaching and learning experiences and advance state-of-the-art research.

# Our Team

Dedicated world-wide  
technical team

Supporting High Performance  
and Adaptive Compute

25+ years experience  
working with academia



# What We Offer



**Research  
Programs**



**Donation  
Program**



**Teaching  
Resources**



**Training**



**Academic  
Solutions**



**Support**

# HACCs: Heterogeneous Accelerated Compute Clusters

Remote access to  
Adaptive Compute hardware

HACC user group meetings

Access to AMD researchers

Collaboration opportunities



**AMD**  
**EPYC**

**AMD**  
**INSTINCT**

**AMD**  
**ALVEO**

**AMD**  
**VERSAL**

[www.amd-haccs.io](http://www.amd-haccs.io)

★ Newest HACC at IISc, Bangalore

# HACC Adaptive Computing Hardware



- HACC hardware consists of:
  - Compute and Alveo™ nodes (initially U250 and U280 with HBM)
  - Latest heterogeneous nodes (SMC 4124GS) include:
    - 2 EPYC™ 3rd generation CPUs
    - 4 AMD Instinct™ MI210 GPUs
    - 2 Alveo U55C FPGA with HBM
    - 2 VCK5000 Versal Adaptive SoC with AIEs
    - Run-time via AMD ROCm™, XRT
    - SW development via HIP, Vitis, frameworks
  - 100G network
- Community hub for researchers
  - Support from in-house AMD research groups
  - Reproducible results & experiments



# Contact Us

## Visit our website to:

- Discover our research programs
- Access educational resources
- Submit a donation request
- Find training & other events



## Email us:

[aup@amd.com](mailto:aup@amd.com)

A screenshot of a web browser displaying the AMD University Program page. The URL www.amd.com/university is visible in the address bar. The page features the AMD logo and navigation links for Products, Solutions, Downloads &amp; Support, and Shop. The main content area is titled "AMD University Program" and describes it as an educator, researcher, and student hub for AMD resources, program, and news. Below the title is a photograph of a group of people in what appears to be a lecture hall or conference setting. A footer menu at the bottom includes Educators, Researchers, Students, Events, FAQ, and Contact Us.

[www.amd.com/AUP](http://www.amd.com/AUP)

# What is Adaptive Computing?

## Optimize for the Workload

Domain-Specific Architecture for your exact requirements, accelerating the whole application

## Adapt as Algorithms Change

Re-implement the silicon after deployment, adapting to evolving use cases

## Accelerate Pace of Innovation

Keep pace with fast moving markets and rapid innovation cycles, e.g., AI algorithms

Adaptive Hardware ("FPGA")  
Conceptual Representation



Matching the Architecture to the Application

*Custom Data Flow, Custom Memory Hierarchy, Custom Precision*



# Evolution to Heterogeneous Platforms

- From FPGAs to adaptive SoCs → matching the engine to the workload
- Balancing diverse technologies for domain-specific requirements

## Domain Specific Optimization



# Field Programmable Gate Array (FPGA)

- Semiconductor devices
- Programmed and reprogrammed by a user
  - Configuration attributes manipulated after manufacturing
  - Matrix of configurable logic blocks (CLBs)
  - Dedicated specialized logic
  - Flexible programmable interconnects
- Ideal fit for many different workloads
  - Massive parallelism
- Hardware adaptability is a unique differentiator from CPUs and GPUs
- Invented in 1985

## Applications

- Automotive
- Broadcast & Pro AV
- Consumer Electronics
- Data Center
- High Performance Computing and Data Storage
- Industrial
- Medical
- Video & Image Processing
- Wired Communications
- Wireless Communications

# Core Adaptable Hardware Technologies



## FPGAs

From high-bandwidth connectivity to massive compute engines

AMD SPARTAN    AMD ARTIX    AMD KINTEX    AMD VIRTEX

## SoCs

Multi-processing subsystem with Arm® cores and integrated FPGA logic

AMD ZYNQ

## Adaptive SoCs

Adaptive Compute Acceleration Platforms for any application, any developer

AMD VERSAL

# Three Ages of FPGAs

- A Retrospective on the First Thirty Years of FPGA Technology
- S. M. Trimberger, "Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology," in Proceedings of the IEEE, vol. 103, no. 3, pp. 318-331, March 2015, DOI: 10.1109/JPROC.2015.2392104

**INVITED PAPER**

## Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology

This paper reflects on how Moore's Law has driven the design of FPGAs through three epochs: the age of invention, the age of expansion, and the age of accumulation.

By STEPHEN M. (STEVE) TRIMBERGER, Fellow IEEE

**ABSTRACT** Since their introduction, field programmable gate arrays (FPGAs) have grown in capacity by more than a factor of 10 000 and in performance by a factor of 100. Cost and energy per operation have both decreased by more than a factor of 1000. These advances have been fueled by process technology scaling, but the FPGA story is more complex than simple technology scaling. The unique effects of Moore's Law have driven qualitative changes in FPGA architecture, applications and tools. As a consequence, FPGAs have passed through several distinct phases of development. These phases, termed "Ages" in this paper, are The Age of Invention, The Age of Expansion and The Age of Accumulation. This paper summarizes each and discusses their driving pressures and fundamental characteristics. The paper concludes with a vision of the upcoming Age of FPGAs.

**KEYWORDS** Application-specific integrated circuit (ASIC); commercialization; economics of scale; field-programmable gate array (FPGA); industrial economics; Moore's Law; programmable logic

**I. INTRODUCTION**

Xilinx introduced the first field programmable gate arrays (FPGAs) in 1984, though they were not called FPGAs until Altera popularized the term around 1988. Over the ensuing 30 years, the device we call an FPGA increased in capacity by more than a factor of 10 000 and increased in speed by a factor of 100. Cost and energy consumption per unit function decreased by more than a factor of 1000 (see Fig. 1).

Manuscript received December 18, 2014; revised November 21, 2014 and December 11, 2014; accepted December 11, 2014. Date of current version April 14, 2015. The author is with Xilinx, San Jose, CA 95124 USA (e-mail: trimberger@xilinx.com). Digital Object Identifier: 10.1109/JPROC.2015.2392104.

© 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/pubs/rights/index.html for more information.

318 PROCEEDINGS OF THE IEEE | Vol. 103, No. 3, March 2015



# FPGA: 7-Series Architecture

- Logic elements distributed on regular columns
  - Scalability from low-cost to high-performance
- High-speed IO
- Clock management
- Interconnect matrix
  - Routing resources



Artix-7 Architecture Overview

# Configurable Logic Block (CLB)

- Primary resource for design in AMD FPGAs
  - Combinatorial functions
  - Flip-flops
- CLB contains two slices
- Connected to switch matrix for routing to other FPGA resources
  - Carry chain runs vertically



# Two Types of CLB Slices

- SLICEM: Full slice
  - Can be used for logic, memory and shift register LUT
  - Has wide multiplexers and carry chain
- SLICEL: Logic and arithmetic only
  - LUT can only be used for logic (not memory)
  - Has wide multiplexers and carry chain



# Slice Resource

- Four six-input Look-Up Tables (LUT)
- Multiplexers
- Carry chains
- Four flip-flops/latches
  - Four additional flip-flops
- The implementation tool will pack multiple slices in the same CLB if certain rules are followed



UG474\_c2\_02\_110510

# 6-Input LUT with Dual Output

- LUTs can be two 5-input LUTs with common input
  - Minimal speed impact to a 6-input LUT
  - One or two outputs
- Any combinatorial function of six variables or two functions of five variables



# Slice Flip-Flops and Flip-Flop/Latches

- Each slice has four flip-flop/latches (FF/L)
  - Can be configured as either flip-flops or latches
- Each slice also has four flip-flops (FF)



# Slice Flip-Flop Capabilities

- All flip-flops are D type
  - Q output
- All flip-flops have a single clock input (CK)
- All flip-flops have an active high chip enable (CE)
- All flip-flops have an active high SR input
  - Input can be synchronous or asynchronous
  - Sets the flip-flop value to a pre-determined



UG474\_c3\_05\_102910

# 7-Series FPGA I/O

- Wide range of voltages
  - 1.2V to 3.3V operation
- Wide I/O standards support
  - Single ended and differential
  - Referenced voltage inputs
  - 3-state capability
- Very high performance
  - Up to 1600 Mbps LVDS
  - Up to 1866 Mbps single-ended for DDR3
- Easy memory interfacing
  - Hardware support for QDRII+ and DDR3
- Digitally controlled impedance
- Power reduction features



# 7-Series Block RAM and FIFO

- Fully synchronous operation
  - Outputs are latched
- Optional internal pipeline register
  - Higher frequency operation
- Two independent ports access common data
  - Individual address, clock, write enable, clock enable
  - Independent data widths for each port



# 7-Series Block RAM and FIFO

- Multiple configuration options
  - True dual-port, simple dual-port, single-port
- Integrated cascade logic
- Byte-write enable in wider configurations
- Integrated control for fast and efficient FIFOs
- Integrated 64/72-bit Hamming error correction



# 7-Series DSP48E1 Slice



# 7-Series FPGAs Clock Management

- Global clock buffers
  - High fanout clock distribution buffer
- Low-skew clock distribution
  - Regional clock routing
- Clock regions
  - Each clock region is 50 CLBs high and spans half the device
- Clock management tile (CMT)
  - One Mixed-Mode Clock Managers (MMCMs) and one Phase Locked Loop (PLL) in each Clock
  - Performs frequency synthesis, clock de-skew, and jitter-filtering
  - High input frequency range



# Programming Model

## Hardware Description Languages (HDL)

- Verilog
- VHDL
- System Verilog
- Closer to the metal
  - Low level abstraction
  - Describe the behaviour



## High-Level Synthesis (HLS)

- C/C++
- High level of abstraction
  - Write algorithms
- Vitis HLS generates the architecture
  - Guided by user directives



# VHDL/Verilog counter

## VHDL

```

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
entity counter is
    Port ( clk: in std_logic;
            rst: in std_logic;
            cout: out std_logic_vector(3 downto 0)
        );
end counter;
architecture rtl of counter is
signal counter_up: std_logic_vector(3 downto 0);
begin
    process(clk)
    begin
        if(rising_edge(clk)) then
            if(rst='1') then
                counter_up <= x"0";
            else
                counter_up <= counter_up + x"1";
            end if;
        end if;
        cout <= counter_up;
    end process;
end rtl;

```

## Verilog

```

module counter(
    input clk,
    input rst,
    output reg [7:0] count
);

    always @(posedge(clk)) begin
        if (rst)
            count <= 0;
        else
            count <= count + 1;
    end
endmodule

```

# Vitis HLS Vector addition

```
void vadd(const int* in1, // Read-Only Vector 1
          const int* in2, // Read-Only Vector 2
          int* out,       // Output Result
          int elements    // Number of elements
        ) {
// Simple vector addition kernel.
vadd1:
    for (int i = 0; i < elements; i++) {
        out[i] = in1[i] + in2[i];
    }
}
```

# What is AMD Vitis™ HLS and HLS Benefits

```
1 #include <stdio.h>
2 int main () {
3     int a;
4     int b;
5     /* for loop execution */
6     for( a = 1; a < 6; a++ )
7     {
8         /* for loop execution */
9         for( b = 1; b <= a; b++ )
10        {
11            printf("%d\n", b);
12        }
13        printf("\n");
14    }
15    return 0;
16 }
```

Structured C/C++



Automated C/C++ to RTL Conversion



Allows Significantly Faster Design Iterations



Significantly Accelerates Simulation – Important  
For Wireless, Video Applications

RTL Code

# **AI on FPGA**

# DNNs and their Potential

1. Requires little domain expertise
2. NNs are a “universal approximation function”
3. If you make it big enough and train it long enough
  - Can outperform humans and existing algorithms on specific tasks



**Will not only increasingly replace other algorithms, but also...**



Nature, Oct 2021



**... solve previously unsolved problems**

- ChatGPT, Copilot
- Stable diffusion
- Protein folding



*Stable Diffusion Prompt: "Pencil sketch of an international group of semiconductor research scientists, studio Ghibli"*

# Spectrum of ML use case with very different requirements



infps = inferences per second

# DNN Compute Requirements are Outpacing Moore's Law



Source: <https://blog.openai.com/ai-and-compute>

**Innovation is needed to provide the necessary performance scalability**

# Specialization Is #1 Industry Approach to Achieve Performance Scalability and Energy Efficiency

GRAPHCORE



habana

WAVE<sup>®</sup>  
COMPUTING

Movidius



# Adaptive Computing or Dedicated Silicon for DPUs



- With increasing specialization of the device, potential sales volume decreases
  - Hard to amortize the increasing NRE costs involved in building ASSPs
  - FPGAs become more attractive
- Increasing specialization scales performance for both ASSPs and FPGAs

**• The opportunity for FPGAs lies in their ability to specialize**

<sup>35</sup> ASSP: Application Specific Standard Product

# **Vitis AI - ML in general**

# Customization levels on Adaptive Computing



# Popular Approach: Matrix of Processing Engines (MPEs) Specializing for AI in general

- Popular layer-by-layer compute
- Batching to achieve high compute efficiency
  - At latency cost (latency ~ batch size)
- Specialized processing engines
  - Operators
  - ALU types
    - tensor-, matrix- or vector-based

**Customized for ML in general**

- Designed to run **any DNN**
- Works really well for computer vision and natural language processing (10s kinfps)
- Popular approach: Vitis AI (FPGA or AIE) as well as majority of AI accelerators



# AMD Vitis™ AI Integrated Development Environment

*A Complete AI Stack for Adaptable AMD Targets*

## Vitis™ AI Tools & Components



## Domain-Specific Architectures

Embedded Deep Learning Processing Units

Data Center Deep Learning Processing Units

## Supported AMD Targets



# AI Model Zoo – Expanding to Diverse AI Applications

- A comprehensive AI model repository
  - Open and free to download for any user
  - State-of-the-art models from Pytorch, TF & TF2
  - Retrainable, applicable to various data set & scenario
  - Deployable on AMD FPGA and Versal Adaptive SoC
- New models in each release



# Extensive Application Coverage

## Classification



- Inception
- Mobilenet
- Resnet
- VGG
- EfficientNet
- MLPerf ResNet50
- OFA ResNet
- Vision Transformer
- Car Type classification
- Car Color classification

## Detection



- ssd\_mobilenet
- Yolov3
- Yolov4
- YoloX
- Refinedet
- EfficientDet
- Pointpillars
- Centerpoint
- CLOCs
- Pointpainting
- Multi-taskv3
- OFA-Yolo

## Segmentation



- ENet
- Semantic FPN
- Salsanext
- Salsanextv2
- SOLO
- Mobilenetv2
- 2D-Unet
- FPN-ResNet18
- Unet-Chaos-CT
- HardNet
- Sa-Gate

## Video Analytics



- Face Recognition
- Face Quality
- Face ReID
- Person ReID
- FairMOT
- FaceMask Detection
- MoveNet

## Industrial Vision/Robotics



- FADNet
- PSMNet
- PMG
- Superpoint
- HFNet

## Medical Image



- RCAN
- SESR
- OFA-RCAN
- DRUnet
- SSR
- C2D2lite

## NLP



- Bert-base
- Sentiment detection
- Customer satisfaction
- Open-information-extraction

## Text-OCR



- Textmountain, OCR

# Compiling for DPU - an XIR-based Toolchain

- Xilinx Intermediate Representation (XIR)
  - Graph-based intermediate representation of the AI algorithms
  - Designed for compilation and efficient deployment of the DPU on the FPGA platform.
- XIR-based compilation flow
  - First, transform the input models to XIR format
  - Breaks up computing graph to subgraphs
  - Execute DPU subgraph to a compiled xmodel file



Techniques for Further  
Specialization with  
**Adaptive Compute Architectures**

# Specialization beyond MPES



# Dataflow - Specializing for Individual Topologies

- Hardware instantiates the topology as a dataflow architecture
  - Customize everything to the **specifics of the given DNN**, any operation, any connectivity
- Benefits:
  - Improved efficiency
  - Low fixed latency
- Scale performance & resources to meet the application requirements
  - If resources allow, we can completely unfold to create a circuit that infers at clock speed and thereby meet these new throughput requirements



**Dataflow can scale performance to meet the application requirements**



# Specialization beyond MPEs



# Customizing Arithmetic to Minimum Precision

- Popular approach which reduces bits in the data representation of weights and activations while preserving accuracy
- Reducing precision shrinks hardware cost/ scales performance
  - Instantiate n-times more compute within the same fabric, thereby scale performance n-times
- Reduces memory footprint
  - NN model can stay on-chip => no memory bottlenecks
- With dataflow: every layer has dedicated compute resources, we can mix and match precision across layers
  - Exploit custom arithmetic at a greater degree than MPUs

**Reducing precision saves resources/ scales performance, and reduces memory**

**However, it requires quantization support in the training software**



$C = f(\text{size of accumulator}, \text{size of weight}, \text{size of activation})$

| Precision | Model size [MB] (ResNet50) |
|-----------|----------------------------|
| 1b        | 3.2                        |
| 8b        | 25.5                       |
| 32b       | 102.5                      |



# Specialization beyond MPEs



# Sparsity

- DNNs are naturally sparse
- Sparse topologies result in irregular compute patterns which are difficult to accelerate on vector- or matrix-based execution units
- With streaming dataflow architectures, where every neuron and synapse is represented in the hardware, we can fully exploit this



FPGA

Optimized  
Dataflow  
on FPGA

## Taking it to the Extreme: **LogicNets**

# Specialization beyond MPEs



# LogicNets with Adaptive Computing



**Traditional**



**LogicNets**



Maximum performance by design (classification at clock rate) [5]  
Compared to unrolled DF: sparse to suit the interconnect

# Unique Opportunity for Adaptive Computing

- FPGAs can scale DNN performance through extreme specialization
- Reduced precision arithmetic
  - Arbitrary bitwidth
  - Mix & match bitwidths between layers
- Fine-grained sparsity
- Scalable, layer-parallel streaming dataflow



**How much do we get out of the different specializations?**

# Deep Network Intrusion Detection System (NIDS)



**Goal:** Implement **NN-based traffic classifier** delivering 100G **line-rate** throughput = 150 Mips  
Latency sensitive (buffer 10s of MB/msec)

DataSet: UNSW-NB15, modulated, noisy, and un-balanced. UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)." 2015 military communications and information systems conference (MilCIS). IEEE, 2015.

# Results – Implementations

## Specialization



# Results – Throughput and Latency



# Resource Cost - Compute, Memory

Efficiency



# Deep Network Intrusion Detection System (NIDS) Results

- This example illustrates the trade-offs between specialization and performance and efficiency
- Custom arithmetic is effective to **scale performance** and **dataflow to reduce latency**
  - If application is amenable, custom arithmetic can meet extreme throughput requirements such as in NIDS
- Reduced precision, fine-granular sparsity & learned circuits can **shrink the resource** requirements despite speedup
- These are some of the opportunities which make most sense to exploit with FPGAs

# **General Introduction to FINN**



# **Project Mission and Key Techniques**



# FINN – Project Mission



- Custom Specialization
  - for creating **high-throughput, ultra-low-latency** DNN inference engines
- End-to-End
  - flow for the **easy** creation of **specialized hardware architectures** for FPGAs
- Open Source
  - for full **transparency and flexibility** to adapt to end user applications and
  - for easy customer interactions

# Two Key Techniques for Customization in FINN

## Streaming Dataflow Architectures for FPGAs



## Custom Precision: Few-bit Weights and Activations



# Customized Dataflow Processing versus More Generic Architectures



# Matrix of Processing Engines (MPEs) Specializing for AI in General

- Popular layer-by-layer compute
- Batching to achieve high compute efficiency
  - At latency cost (latency ~ batch size)
- Customized for ML in general
  - Designed to run any DNN
  - Specialized processing engines
    - Operators
    - ALU types
- Works really well for computer vision and natural language processing
- Popular approach: Vitis AI (FPGA or AIE) as well as majority of AI accelerators



# Dataflow - Specializing for Individual Topologies

- Hardware instantiates the topology as a dataflow architecture
- Customize everything to the specifics of the given DNN, any operation, any connectivity
- Benefits
  - Improved efficiency
  - Low fixed latency
- Scale performance and resources to meet the application requirements



**Dataflow can scale performance to meet the application requirements**



# Dataflow Processing:

## *Scaling to Meet Performance and Resource Requirements*



1. Scale performance and resources to meet the application requirements
2. If resources allow, we unfold completely, creating a circuit for inference at clock speed

# Customized Dataflow Processing versus More Generic Architectures

**Matrix of Processing Engines (MPE)**  
(Vitis AI, TPUs, GPUs)



- Customized for typical DNN operations
  - e.g., multiply accumulate
- Lower throughput (~10KRps)
- Flexibility through programming
- Applications: CV, Speech

**Dataflow Architectures with FPGAs and FINN**



- Customized/adapted for specific DNN topologies
- Streaming interfaces
- Specialization -> higher efficiency
- Lower latency (no intermediate buffering)
- Higher throughput (~100MRps)
- Flexibility through reconfiguration
- Applications: radio, networking, material science, particle physics – smaller DNNs

# Quantization

- Reducing precision shrinks hardware cost/scales performance
  - For integer datatypes, LUT cost proportional to both bitwidths in weight and activations (e.g., INT8 : INT1  $\approx$  70x)
  - n-times more compute fits into the same fabric, thereby, scaling performance n-times or shrinking hardware cost accordingly
- Energy
  - Faster execution or smaller footprint  $\rightarrow$  less energy ( $E = P \cdot time$ )
  - Using reduced precision operators saves energy
  - Reduces memory footprint
    - ResNet50 @ 32b: 102.5 MB, ResNet50 @ 2b: 6.4 MB
    - NN model can stay on-chip  $\rightarrow$  no external memory access  $\rightarrow$  saves energy

| Precision | Model size [MB]<br>(ResNet50) |
|-----------|-------------------------------|
| 1b        | 3.2                           |
| 8b        | 25.5                          |
| 32b       | 102.5                         |



| Operation            | Picojoules per Operation |                   |                      |     |
|----------------------|--------------------------|-------------------|----------------------|-----|
|                      | 45 nm                    | 7 nm              | 45 / 7               |     |
| +                    | Int 8                    | 0.03              | 0.007                | 4.3 |
|                      | Int 32                   | 0.1               | 0.03                 | 3.3 |
|                      | BFloat 16                | --                | 0.11                 | --  |
|                      | IEEE FP 16               | 0.4               | 0.16                 | 2.5 |
|                      | IEEE FP 32               | 0.9               | 0.38                 | 2.4 |
| ×                    | Int 8                    | 0.2               | 0.07                 | 2.9 |
|                      | Int 32                   | 1.48              | 1.48                 | 2.1 |
|                      | BFloat 16                | --                | 0.21                 | --  |
|                      | IEEE FP 16               | 1.1               | 0.34                 | 3.2 |
|                      | IEEE FP 32               | 3.7               | 1.31                 | 2.8 |
| SRAM                 | 8 KB SRAM                | 10                | 1.5                  | 1.3 |
|                      | 32 KB SRAM               | 20                | 8.5                  | 2.4 |
|                      | 1 MB SRAM <sup>1</sup>   | 100               | 14                   | 7.1 |
| GeoMean <sup>1</sup> |                          |                   |                      |     |
| DRAM                 |                          | Circa 45 nm       | Circa 7 nm           |     |
|                      | DDR3/4                   | 1300 <sup>2</sup> | 1300 <sup>2</sup>    | 1.0 |
|                      | HBM2                     | --                | 250-450 <sup>2</sup> | --  |
|                      | GDDR6                    | --                | 350-480 <sup>2</sup> | --  |

is pJ per 64-bit access.

# The FINN Framework



# FINN Framework: From DNN to FPGA Deployment



## Brevitas

Training in PyTorch  
Algorithmic optimizations

## FINN Compiler

Hardware Architecture  
Build

## Deployment



- Train or even learn reduced precision DNNs
  - Library of standard layers
  - Pretrained examples
- 
- Perform optimizations
  - Assemble parameterized HLS/RTL modules
  - Generate a DNN hardware IP
- 
- Embed the DNN IP into an infrastructure design
  - Generate a Python run-time
  - Enable integration with your application
  - System integration available for some embedded and Alveo platforms, including HACC

# Brevitas:

## *A PyTorch Library for Quantization-Aware Training*



# FINN Compiler

## *Transform DNN into Custom Dataflow Architecture*

QONNX representation of the quantized DNN



### FINN

- Uses an ONNX-based network description as intermediate representation (IR)
- Is a Python library of graph transformations
- Generates a synthesizable description of each layer (HLS/RTL) encapsulated as an IP block
- Produces a synthesized stitched IP block representing the complete network



Stitched DNN accelerator IP

# FINN Compiler - Network preparation



## QONNX

- Directly exported from Brevitas
- Input format to FINN compiler
- Quantization operator(s)
  - Quant, BipolarQuant, Trunc
- No tensor annotations



## FINN-ONNX

- Previously used as input format
- IR in the FINN compiler
- MultiThreshold to represent activation quantization
- Custom datatype annotations on tensor

# FINN Passes - ONNX Graph Transformations



# FINN Passes - ONNX Graph Transformations



```
template<
  unsigned MW,      // Width of the input matrix
  unsigned MH,      // Height of the input matrix
  unsigned SIMD,    // Number of input columns computed in parallel
  unsigned PE,      // Number of output rows computed in parallel
  typename TI,      // Input Datatype
  typename TO,      // Output Datatype
  typename TW,      // Weight Datatype
  typename TA       // Activation Datatype
>
void Matrix_Vector_Activate_Batch(
  hls::stream<hls::vector<TI>> &in,
  hls::stream<hls::vector<TO>> &out,
  TW const &weights,
  TA const &activation
);
```

Corresponding to  
finn-hlslib function call  
or finn-rtllib module

Optimization, lowering, code generation... are all transformations

# FINN Hardware Folding



# FINN HLS/RTL Library - Parameterizable Kernel Library

- Kernels representing individual layers, a.k.a. Operators
- Flexible parametrization as for
  - Degree of parallelism (output channels, input channels, kernel dimensions ...)
  - Datatypes (INT8, ternary, INT2, ...)
  - Behaviour (activation function)
- Instantiated and stitched by FINN compiler with AXI-Stream data path
- Implemented as synthesizable C++ (Vitis HLS) or SystemVerilog



# FINN Compiler: IP Generation Flow



# Deployment with PYNQ™ for Python Productivity

```
# instantiate the accelerator
accel = models.cnv_w2a2_cifar10()
# generate an empty numpy array to use as input
dummy_in = np.empty(accel.ishape_normal, dtype=np.uint8)
# perform inference and get output
dummy_out = accel.execute(dummy_in)
```



- Use PYNQ-provided Python abstractions and drivers
- User provides NumPy array input, calls driver, retrieves NumPy array output
  - Internally use PYNQ DMA driver to wr/rd NumPy arrays into I/O streams

<https://github.com/Xilinx/PYNQ>

<https://github.com/Xilinx/finn-examples>

# FINN Dataflow Build Mode

FINN flow = Python script  
making calls to FINN API

Consists of a sequence of  
steps, each step

- is a Python function with a standardized interface
- consumes and produces ONNX
- may produce other files
- may be standard or custom
- may have config-dependent behavior



Produce output files from  
input ONNX and config



Can be resumed from  
intermediate steps

ONNX files act as checkpoints

# **FINN Infrastructure and Workflow**



# The FINN Ecosystem and Software Stack



 **FINN** project landing page: <https://xilinx.github.io/finn>

- Quick Start, Documentation, Examples (Jupyter Notebooks)
- Links to Repos

# A FINN End-to-End Flow

Brevitas

Network Preparation

Vivado HLS and IPI



Brevitas FINN-ONNX Export

Streamlining Transformations

Convert to HW Layers

Specialize Layers

Adjust folding to maximize performance

Create IP per layer

Create stitched design

Network of HLS/RTL layers, stitched IP  
**Ready to be integrated in Vivado IPI**

- Trained Network in PyTorch/Brevitas

- Network of high-level ONNX layers

- Streamlined network of high-level ONNX layers

- Network of HW layers, maximum folding

- Network of HLS/RTL layers, maximum folding

- Network of HLS/RTL layers, desired folding

- Network of HLS/RTL layers, IP per layer

Traditional HW Design RTL Simulation

## Simulation and Emulation Flows

Simulation using Python

Prepare cppsim

Run cppsim (HLS C++)

Network of HLS layers With C++ wrappers

Prepare rtlstim (stitched)

Prepare rtlstim (layer by layer)

Full-network Verilator model

Network of HLS/RTL layers with Verilator models

Emulation (rtlstim) using PyVerilator

# FINN Workflow



FINN and Brevitas can be used as co-design tools to implement your DNN use case on an FPGA.

- Train a quantized neural network in PyTorch using Brevitas
- Converting trained QNN to Vivado IP
- Fine-tune model to meet resource/performance targets
- Integrate generated IP into a larger design

But you can leverage the infrastructure beyond that...

**Brevitas**

**Network Preparation**

Brevitas FINN-ONNX Export

Streamlining Transformations

Convert to HW Layers

Specialize Layers

Adjust folding to maximize performance

**Vivado HLS and IPI**

Create IP per layer

Create stitched design

Network of HLS/RTL layers, stitched IP  
**Ready to be integrated in Vivado IPI**



# Research in the FINN Ecosystem



# **Status and Outlook**



# Status Summary

- **Open-Source Adoption**
  - ~2k+ GitHub stars summarized across repos
  - 250k+ Brevitas downloads
  - ~200k QONNX downloads
  - 17k+ FINN compiler downloads
- **Academic Results**
  - ACM TRETS 2020, FPL'2020, DFT'2019 Best Paper awards
  - 1000+ citations on original paper
- **University Classes on computer architecture for ML with FINN**
  - Stanford, UNC Charlotte, NTNU in Norway, EPFL in Switzerland
  - Regular tutorials, also available on YouTube: <https://www.youtube.com/watch?v=zw2aG4PhzmA>
- **Business units providing customer support**
  - Lead engineering team: Custom and Strategic Engineering, Dublin

*"The FINN toolset is showing huge potential using it in upcoming SICK products.*

*It is easy to use and with an extraordinary performance and very promising results.*

*In the future, flexible implementations of ML in our products with FINN can be a great advantage and even replace static architectures as they are currently used.*

*Thanks to the FINN team for the great cooperation"*

*– Sick AG*

<https://github.com/Xilinx/brevitas>

<https://github.com/Xilinx/finn>

<https://github.com/Xilinx/finn-hlslib>

<https://github.com/Xilinx/finn-examples>

<https://github.com/fastmachinelearning/qonnx>

# FINN Layer Support

| Layer                          | Current Support              | Outlook                 |
|--------------------------------|------------------------------|-------------------------|
| <b>GEMM</b>                    | ✓                            |                         |
| <b>Conv1D and Conv2D</b>       | ✓                            |                         |
| - <b>Dense</b>                 | ✓                            |                         |
| - <b>Depthwise</b>             | ✓                            |                         |
| - <b>Separable (pointwise)</b> | ✓                            |                         |
| <b>Elementwise (add, sub)</b>  | ✓                            | others easily doable    |
| <b>Activation</b>              | ReLU, SeLU                   |                         |
| <b>BatchNorm</b>               | ✓ (absorbed by streamlining) |                         |
| <b>Pooling</b>                 | ✓                            |                         |
| <b>Scale</b>                   | ✓ (absorbed by streamlining) |                         |
| <b>Concat</b>                  | ✓                            |                         |
| <b>Reshape</b>                 | ✓ (must be streamlinable)    |                         |
| <b>Transpose</b>               | ✓ (must be streamlinable)    |                         |
| <b>Clip by Value</b>           | ✓ (absorbed by streamlining) |                         |
| <b>TransposeConv2D</b>         | ✓                            | optimized version (WIP) |
| <b>UpSample</b>                | ✓                            |                         |
| <b>DownSample</b>              | ✓                            |                         |

# Brevitas Updates

- Targets the **entire AMD product range**
- First-class support for **integer datatypes**
  - prototype support for minifloats (e.g., FP8)
- Supports **PTQ and QAT**
- Out of the box support for distributed training – (e.g., DDP, interoperability with **HuggingFace Accelerate (PP)**)
- Interoperability with **HuggingFace Transformers**



<https://github.com/Xilinx/brevitas>

# FINN Compiler Updates

FINN v0.10.1  
Release

- **Refactoring** of operator instantiation infrastructure
  - FINN compiler used to assume that hardware blocks are synthesized from HLS code
  - New class hierarchy to facilitate integration of RTL components
  - Provide users with an interface to override the compiler's choice for HLS vs. RTL implementation on a per-layer basis
- **RTL component** library optimizing the implementations of critical layers
  - Efficient implementation of 4-bit and 8-bit compute leveraging DSP slices
  - Efficient implementation of multi-level thresholding
  - Eradication of (regularly long) HLS synthesis times for layers with an RTL option
- **Compiler optimization pass** for accumulator and weight bit width minimization
- **Added board support** in system integration flow
  - **RFSoc 4x2** and **U55C** (contributed by University of Paderborn)

# FINN Technical Roadmap: Capabilities

- Operator Hardening
  - **Revised RTL Thresholding by binary search**
    - Ingestion of fp32 inputs
- DSP-enabled Generalized Datatype Support
  - Efficient higher-precision integer compute: **int4**, **int8**, ..., **int16**
  - Small standard floating-point formats: float16, bfloat16
  - Custom MiniFloats: fp4 – fp8
- Internal clock pumping of DSP datapaths to increase their operational density  
We are aiming at a standard operational frequency around 500 MHz
- New Operators
  - Optimized **transposed convolution**
  - Fallback float layers to mitigate streamlining limits

# FINN Technical Roadmap: Ease of Use

- **FINN Library**

- Refactoring of streamed layer interfaces
  - Packed flat `ap_uint<W>` → explicit `hls::vector<T, N>`
- Combining HLS and RTL components into one FINN Library

- **FINN Examples**

- MobileNet-v1 and VGG10-RadioML **with efficient DSP compute**
- New example: German Traffic Sign Recognition Benchmark

FINN-examples v0.0.7  
Release

# Resources

- <https://github.com/Xilinx/brevitas>  
<https://github.com/Xilinx/finn>
- <https://github.com/Xilinx/finn-hlslib>
- <https://github.com/Xilinx/finn-examples>
- <https://github.com/fastmachinelearning/qonnx>
  
- <https://amd.com/aup>

**Q & A**



# COPYRIGHT AND DISCLAIMER

©2024 Advanced Micro Devices, Inc. All rights reserved.

AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS.' AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

**AMD**