



**EECS 151/251A**  
**Spring 2023**  
**Digital Design and Integrated**  
**Circuits**

Instructor:  
J. Wawrzynek

Lecture 5: FPGAs

# FPGAs are in widespread use



**FPGAs Power Net-Centric  
Battlefield on Many Fronts**

Far more different designs are implemented in FPGAs than in custom chips.

**Automotive Innovators  
Hit High Gear in  
Driver Assistance  
with FPGA Platforms**

**Plugging into  
High-Volume  
Consumer  
Products**

**HIGH VOLUME**  
Spartan-3E: A New Era  
Multimedia for Automotive  
DSP Algorithms

**DESIGN TOOLS**  
New ISE 7.1i Software  
Control Your Designs

**SERIAL I/O**  
Extend Your Reach

# ACCELERATING THE FUTURE WITH INTEL FPGAS

Field programmable gate arrays (FPGAs) are taking computing to new heights by offering engineers the ability to program digital logic in the field on a chip many times — from anywhere. Here's what a system-level integrated circuit like an FPGA can do for you.



# And in the Data-Center



Image credit: Copyright © 2018 Microsoft Corporation. All rights reserved.

# EECS151 FPGA Lab Board



# FPGA: Xilinx Virtex-5 XC5VLX110T



Virtex-5 enhanced die photo



A die is an unpackaged part

# FPGA Overview

- Basic structure: two-dimensional array of logic blocks and flip-flops with a means for the user to configure (program):
  1. the interconnection between the logic blocks,
  2. the function of each block.



*Simplified version of FPGA internal architecture*

# *Why are FPGAs Interesting?*

- Technical viewpoint:
  - For hardware/system-designers, like ASICs - only better: “Tape-out” new design every few minutes/hours.
  - “reconfigurability” or “reprogrammability” may offer other advantages over fixed logic?
    - In-field reprogramming? Dynamic reconfiguration? Self-modifying hardware, evolvable hardware?

*Of course, the higher flexibility comes at the expense of larger die area, slower circuits, and more energy per operation.*

# Why are FPGAs Interesting?

- Staggering logic capacity growth (10000x):

| Year Introduced | Device    | Logic Cells | “logic gate equivalents” |
|-----------------|-----------|-------------|--------------------------|
| 1985            | XC2064    | 128         | 1024                     |
| 2011            | XC7V2000T | 1,954,560   | 15,636,480               |



- FPGAs have tracked Moore's Law better than any other programmable device.



# Why are FPGAs Interesting?

- Logic capacity now only part of the story: on-chip RAM, high-speed I/Os, “hard” function blocks, ...
- Modern FPGAs are “reconfigurable systems on a chip”

Xilinx Virtex-5 LX110T



Xilinx ZYNQ - embedded ARM cores



# Energy Efficiency of CPU versus ASIC versus FPGA

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News, 38:37–47, June 2010.



Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA '06, pages 21–30, New York, NY, USA, 2006. ACM



$$\therefore \text{FPGA} : \text{CPU} = 70x$$

*Similar story for performance efficiency*



# FPGA Internals

# *Background for upcoming technical details*

Review: mux or multiplexor is a combinational logic circuit that chooses between  $2^N$  inputs under the control of N control signals.



A latch is a 1-bit memory (similar to a but “level sensitive” not edge-triggered, closer to an SRAM storage cell).



# FPGA Programmability

- FPGA programmability allows users to:
  - 1.define function of configurable logic blocks (CLBs),
  - 2.establish interconnection paths between CLBs
  - 3.set other options, such as clock, reset connections, and I/O.
- Most FPGAs have “**SRAM based**” programmability.



## Programmable Cross-points

- Latch-based (Xilinx, Intel/Altera, ...)



- + reconfigurable
- volatile
- relatively large.

# User Programmability

- ❑ Latches store the configuration.
- ❑ Configuration *bitstream* is loaded under user control.
- ❑ “partial reconfiguration”: a selective part of the array can be reprogrammed without disturbing the other parts.
- ❑ Dynamic / runtime reconfiguration: reprogramming during a computation.
- ❑ **Most commonly the entire device is programmed when the system is booted.**



# Simplified FPGA Logic Block



- ❑ Look up table (LUT)
  - implements any combinational logic function
- ❑ Register (Flip-flop)
  - optionally stores output of LUT

# 4-LUT Implementation



- ❑ LUTs size named by number of inputs
- ❑ n-bit LUT is implemented as a  $2^n \times 1$  memory:
  - inputs choose one of  $2^n$  memory locations.
  - memory locations (latches) are loaded with values from user's configuration bit stream.
  - Inputs to mux control are the LUT inputs.
- ❑ Result is a general purpose “logic gate”.
  - n-LUT can implement any function of n inputs!

# LUT as general logic gate

- An n-LUT is a direct implementation of a function truth-table.
- Each latch location holds the value of the function corresponding to one input combination.

*Example: 2-input functions*

| INPUTS | AND | OR |   |   |   |
|--------|-----|----|---|---|---|
| 00     | 0   | 0  |   |   |   |
| 01     | 0   | 1  | • | • | • |
| 10     | 0   | 1  |   |   |   |
| 11     | 1   | 1  |   |   |   |

*A 2-lut Implements any function of 2 inputs.*

*How many of these are there?*

*How many functions of n inputs?*

*Example: 4-lut*

| INPUTS |                                 |
|--------|---------------------------------|
| 0000   | F(0,0,0,0) ← store in 1st latch |
| 0001   | F(0,0,0,1) ← store in 2nd latch |
| 0010   | F(0,0,1,0) ←                    |
| 0011   | F(0,0,1,1) ←                    |
| 0011   |                                 |
| 0100   | •                               |
| 0101   | •                               |
| 0110   | •                               |
| 0111   |                                 |
| 1000   |                                 |
| 1001   |                                 |
| 1010   |                                 |
| 1011   |                                 |
| 1100   |                                 |
| 1101   |                                 |
| 1110   |                                 |
| 1111   |                                 |

# FPGA Generic Design Flow

- ▶ Design Entry:
  - ▶ HDL (hardware description languages: Verilog, VHDL)
- ▶ Design Implementation:
  - ▶ Logic synthesis (in case of using HDL entry) followed by,
  - ▶ Partition, place, and route to create configuration bit-stream file
- ▶ Design verification:
  - ▶ Optionally use simulator to check function
  - ▶ Load design onto FPGA device (cable connects PC to development board), optional “logic scope” on FPGA
  - ▶ check operation at full speed in real environment.



<https://digitalsystemdesign.in/fpga-implementation-step-by-step/>

# Example Partition, Placement, and Route

- Simplified FPGA structure:



- Example Circuit:

- collection of gates and flip-flops



*Circuit combinational logic must be “covered” by 4-input 1-output LUTs.*

*Flip-flops from circuit must map to FPGA flip-flops.*

*(Best to preserve “closeness” to CL to minimize wiring.)*

*Best placement in general attempts to minimize wiring.*

*Vdd, GND, clock, and global resets are all “prewired”.*

# Example Partition, Placement, and Route



## Example Circuit:

- collection of gates and flip-flops



Two partitions. Each has single output, no more than 4 inputs, and no more than 1 flip-flop. In this case, inverter goes in both partitions.

Note: (with 4-LUTs) the partition can be arbitrarily large as long as it has not more than 4 inputs and 1 output, and no more than 1 flip-flop.

# Configurable Interconnect



## ▶ Design Challenges (topology):

- ▶ traversing long wires incurs delay and energy
- ▶ switches (transistors) add significant delay
- ▶ Mapping time

switch matrix could be more richly populated

“connection block”

# Xilinx FPGAs (interconnect detail)



# Embedded Hard Blocks

- ▶ Many important functions are not efficient when implemented in the reconfigurable fabric:
  - ▶ multiplication, large memory, ...
- ▶ Dedicated blocks take relatively little area and therefore could go unused.



Colors represent different types of resources:

Logic

Block RAM

DSP (ALUs)

Clocking

I/O

Serial I/O + PCI

A routing fabric runs throughout the chip to wire everything together.



# State-of-the-Art - Xilinx FPGAs



*Virtex Ultra-scale*

| Device Name                                 | VU3P      | VU5P      | VU7P      | VU9P      | VU11P     | VU13P     | VU27P     | VU29P     | VU31P     | VU33P     | VU35P     | VU37P     |
|---------------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| System Logic Cells (K)                      | 862       | 1,314     | 1,724     | 2,586     | 2,835     | 3,780     | 2,835     | 3,780     | 962       | 962       | 1,907     | 2,852     |
| CLB Flip-Flops (K)                          | 788       | 1,201     | 1,576     | 2,364     | 2,592     | 3,456     | 2,592     | 3,456     | 879       | 879       | 1,743     | 2,607     |
| CLB LUTs (K)                                | 394       | 601       | 788       | 1,182     | 1,296     | 1,728     | 1,296     | 1,728     | 440       | 440       | 872       | 1,304     |
| Max. Dist. RAM (Mb)                         | 12.0      | 18.3      | 24.1      | 36.1      | 36.2      | 48.3      | 36.2      | 48.3      | 12.5      | 12.5      | 24.6      | 36.7      |
| Total Block RAM (Mb)                        | 25.3      | 36.0      | 50.6      | 75.9      | 70.9      | 94.5      | 70.9      | 94.5      | 23.6      | 23.6      | 47.3      | 70.9      |
| UltraRAM (Mb)                               | 90.0      | 132.2     | 180.0     | 270.0     | 270.0     | 360.0     | 270.0     | 360.0     | 90.0      | 90.0      | 180.0     | 270.0     |
| HBM DRAM (GB)                               | -         | -         | -         | -         | -         | -         | -         | -         | 4         | 8         | 8         | 8         |
| HBM AXI Interfaces                          | -         | -         | -         | -         | -         | -         | -         | -         | 32        | 32        | 32        | 32        |
| Clock Mgmt Tiles (CMTs)                     | 10        | 20        | 20        | 30        | 12        | 16        | 16        | 16        | 4         | 4         | 8         | 12        |
| DSP Slices                                  | 2,280     | 3,474     | 4,560     | 6,840     | 9,216     | 12,288    | 9,216     | 12,288    | 2,880     | 2,880     | 5,952     | 9,024     |
| Peak INT8 DSP (TOP/s)                       | 7.1       | 10.8      | 14.2      | 21.3      | 28.7      | 38.3      | 28.7      | 38.3      | 8.9       | 8.9       | 18.6      | 28.1      |
| PCIe® Gen3 x16                              | 2         | 4         | 4         | 6         | 3         | 4         | 1         | 1         | 0         | 0         | 1         | 2         |
| PCIe Gen3 x16/Gen4 x8 / CCIX <sup>(1)</sup> | -         | -         | -         | -         | -         | -         | -         | -         | 4         | 4         | 4         | 4         |
| 150G Interlaken                             | 3         | 4         | 6         | 9         | 6         | 8         | 6         | 8         | 0         | 0         | 2         | 4         |
| 100G Ethernet w/ KR4 RS-FEC                 | 3         | 4         | 6         | 9         | 9         | 12        | 11        | 15        | 2         | 2         | 5         | 8         |
| Max. Single-Ended HP I/Os                   | 520       | 832       | 832       | 832       | 624       | 832       | 520       | 676       | 208       | 208       | 416       | 624       |
| GTY 32.75Gb/s Transceivers                  | 40        | 80        | 80        | 120       | 96        | 128       | 32        | 32        | 32        | 32        | 64        | 96        |
| GTM 58Gb/s PAM4 Transceivers                |           |           |           |           |           |           | 32        | 48        |           |           |           |           |
| 100G / 50G KP4 FEC                          |           |           |           |           |           |           | 16 / 32   | 24 / 48   |           |           |           |           |
| Extended <sup>(2)</sup>                     | -1-2-2L-3 |
| Industrial                                  | -1-2      | -1-2      | -1-2      | -1-2      | -1-2      | -1-2      | -1-2      | -1-2      | -         | -         | -         | -         |

# Atoms: 5-input Look Up Tables (LUTs)



| A[6:2] |   | D |
|--------|---|---|
| 00000  | 1 |   |
| 00001  | 0 |   |
| 00010  | 1 |   |
| .      |   |   |
| 11101  | 0 |   |
| 11110  | 0 |   |
| 11111  | 1 |   |



Computes any 5-input logic function.

Timing is independent of function.

Latches set during configuration.

# Virtex 6-LUTs: Composition of 5-LUTs

May be used  
as one  
6-input LUT  
(D6 out) ...

---

... or as two  
5-input LUTS  
(D6 and D5)

---



Figure 3: Block Diagram of a Virtex-5 6-Input LUT

Combinational  
logic  
(post configuration)

# Configurable Logic Blocks (CLBs)

*Slices define regular connections to the switching fabric, and to slices in CLBs above and below it on the die.*



# The simplest view of a slice



Four 6-LUTs

Four Flip-Flops

*Switching fabric may see  
combinational and registered  
outputs.*

An actual Virtex slice adds many small features to this simplified diagram. We show them one by one ...

# Two 7-LUTs per slice ...



Extra  
multiplexers(F7AMUX,  
F7BMUX)  
Extra inputs  
(AX and CX)

*Or one 8-LUTs per slice ...*



Third  
multiplexer(F8MUX)

---

Third input  
(BX)

# Extra muxes to chose LUT option ...



*From eight 5-LUTs  
... to one 8-LUT.*

---

*Combinational  
or registered outs.*

---

*Flip-flops unused by  
LUTs can be used  
standalone.*

# Virtex Vertical Logic



We can map ripple-carry addition onto carry-chain block.



The carry-chain block also useful for speeding up other adder structures and counters.

## *Putting it all together ... a SLICEL.*



The previous slides explain all SLICEL features.

---

About 50% of the are SLICELs.

---

The other slices are SLICEMs, and have extra features.

# Recall: 5-LUT architecture ...



| A[6:2] | D |
|--------|---|
| 00000  | 1 |
| 00001  | 0 |
| 00010  | 1 |
| •      |   |
| 11101  | 0 |
| 11110  | 0 |
| 11111  | 1 |



- ▼ 32 Latches.
- ▼ Configured to 1 or 0.

Some parts of a logic design need many state elements.

SLICEMs replace normal 5-LUTs with circuits that can act like 5-LUTs, but can alternatively use the 32 latches as RAM, ROM, shift registers.

# Virtex DSP48E Slice



\*These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources.

UG193\_c1\_01\_032806

Efficient implementation of multiply, add, bit-wise logical.

To be continued ...

Throughout the semester, we will look at different FPGA features in-depth.

Switch fabric

Block RAM

DSP48 (ALUs)

Clocking

I/O

Serial I/O + PCI

