

# L23: Advanced FPGAs

18-240: Structure and Design of Digital Systems

Tom Zajdel & Bill Nace  
Fall 2023

© 2004 - 2023 All Rights Reserved. All work contained herein is copyrighted and used by permission of the authors. Contact ece240-staff@lists.andrew.cmu.edu for permission or for more information.



# 18-240: Where are we...?

Carnegie Mellon

- 1 Handout: Lecture Notes
- Final Exemptions on Friday

| Week | Date  | Lecture                                | Reading | Lab                 | HW    |
|------|-------|----------------------------------------|---------|---------------------|-------|
| 11   | 11/14 | Midterm 2                              |         | No Lab              | HW 8  |
|      | 11/16 | L18 Running Assembly on the Datapath 1 |         |                     |       |
|      | 11/17 | GA Small Group Meetings                |         |                     |       |
| 12   | 11/21 | L19 Running Assembly on the Datapath 2 |         | No Lab              | No HW |
|      | 11/23 | Thanksgiving; No class                 |         |                     |       |
|      | 11/24 | Thanksgiving; No small groups          |         |                     |       |
| 13   | 11/28 | L20 Memory and Control Flow            |         | Lab 5               | HW 9  |
|      | 11/30 | L21 Timing and Optimizations           |         |                     |       |
|      | 12/1  | GB Small Group Meetings                |         |                     |       |
| 14   | 12/5  | L22 Extending the Design               |         | HW A<br>Due<br>12/8 |       |
|      | 12/7  | L23 Advanced FPGAs                     |         |                     |       |
|      | 12/8  | GC Small Group Meetings                |         |                     |       |
|      | 12/17 | Final Exam                             |         |                     |       |

# Final Exam

Carnegie Mellon

---

- **17 December, 8:30-11:10pm**
  - In DH 2210
- **Comprehensive**
  - But, weighted towards Comp Architecture material
- **3 Note Sheets**
  - 8.5"x11" or A4, both sides
  - Handwritten by you
- **Review Session**
  - Wednesday?? Watch Piazza for an announcement

# Today: Advanced FPGAs

Carnegie Mellon

- History of FPGAs
- Close look at Xilinx Spartan 3
- “Modern” FPGAs
- Note: Xilinx examples are used, but other companies also provide exceptional FPGAs
  - Especially Altera, Inc
  - Many others...

| Vendor         | 2015           |              | 2016           |              |                  |
|----------------|----------------|--------------|----------------|--------------|------------------|
|                | FPGA Total     | Market share | FPGA Total     | Market share | Growth CY15-CY16 |
| Xilinx         | \$2,044        | 53%          | \$2,167        | 53%          | 6%               |
| Intel (Altera) | \$1,389        | 36%          | \$1,486        | 36%          | 7%               |
| Microsemi      | \$301          | 8%           | \$297          | 7%           | -1%              |
| Lattice        | \$124          | 3%           | \$144          | 3%           | 16%              |
| QuickLogic     | \$19           | 0%           | \$11           | 0%           | -40%             |
| Others         | \$2            | 0%           | \$2            | 0%           | 0%               |
| <b>TOTAL</b>   | <b>\$3,879</b> | <b>100%</b>  | <b>\$4,112</b> | <b>100%</b>  | <b>6%</b>        |



achronix™  
SEMICONDUCTOR CORPORATION

Altera now owned by Intel



Xilinx now owned by AMD

# Aside: Microprocessor vs FPGA

Carnegie Mellon

- Both FPGA and uP provide **generic functionality**
  - uP can execute different **programs**
    - ◆ Specified in Assembly Language
    - ◆ Complex designs quite attainable by “easily” trained engineers
    - ◆ Embedded systems: Computers that do specific task
  - FPGA can be configured to act as different **hardware**
    - ◆ Specified in HDL (SystemVerilog, VHDL, etc)
    - ◆ Complex designs potentially attainable by “highly” trained engineers
- uP relies upon Fetch / Decode / Execute cycle
  - Therefore can only execute a single instruction at a time
    - ◆ modulo Instruction-Level Parallelism, multi-core processors, etc
- FPGA uses Hardware Threads (FSM-D)
  - Therefore can execute lots of stuff at a time
  - Potentially much higher performance

# Not so long ago

Carnegie Mellon

- Digital logic implemented with chips
  - SSI, MSI, LSI, ...
  - Small package complexity meant designs with lots of chips
  - Designs limited by inter-chip routing
  - ◆ 18-240 taught with protoboards, wire-wrap, soldering



# Not so long ago & somewhat today

Carnegie Mellon

- Big designs implemented as ASICs
  - ASIC = Application Specific Integrated Circuit
    - ◆ Chip built for a special purpose
- 3 Design approaches
  - Full custom design
    - ◆ Designer controls every transistor, route, I/O
    - ◆ Expensive: thus only used for high-volume/high-end designs
      - CPUs, commodity devices, sometimes special purpose
  - Standard cell design
    - ◆ Low level functionality via library of silicon functions
    - ◆ Performed slower, cheaper than full-custom
  - Gate Array
    - ◆ Wafer manufactured with sea of primitives (inverters, NAND)
    - ◆ Designer customizes upper metal layers to connect primitives

# Economic Argument for FPGAs

Carnegie Mellon

$$\text{Total Cost} = (\text{cost/unit}) * (\# \text{ units}) + \text{NRE costs}$$

- FPGAs: High package cost (\$1000+), low NRE costs
- ASICs: Low package cost, high NRE costs (\$50M+)<sup>1</sup>
  - ASIC NRE  $\approx f(\# \text{ transistors})$

(Courtesy Xilinx, Inc.)



<sup>1</sup>Design cost is ~\$65M for 28-40 nm designs (2012 Data)

# Early FPGA trends

Carnegie Mellon

- More than 20x bigger per decade
- More than 5x faster
- More than 50x cheaper / more power efficient



(Courtesy Xilinx, Inc.)

# Enter the FPGA

Carnegie Mellon

- Xilinx XC3020 (~1990)
- 1st truly usable FPGA family
- Logic block: 2 LUTs, 2 FFs
- 2 FFs per I/O
- I/O: 5V TTL
- SRAM based
- Design entry via schematics or primitive HDL like ABEL
  - HDLs created for smaller primitive devices like PALs



(Courtesy Xilinx, Inc.)

# XC3020

Carnegie Mellon



# Logic Element (LE)

Carnegie Mellon

- Also called a Configurable Logic Block (CLB)
- Includes a Look-up Table (LUT) and F/F(s)
  - Signal routing within LE done with multiplexers
    - ◆ In picture below, S determines if output comes from F/F or from LUT
  - Contents of LUT and multiplexer-based routing are controlled by configuration bits (static memory, fuses, etc)



- Can be configured for any combinational or sequential circuit (of small enough size)

# Logic Elements

Carnegie Mellon

- LEs get more and more stuff crammed in them throughout time
  - XC3K family had LUT, 2 FFs, clock enable, FF reset and 9 muxes
    - ◆ LUT had 5 direct inputs, 2 FF values as inputs, 2 outputs
    - ◆ ~51 bits of configuration SRAM per CLB



(Courtesy Xilinx, Inc.)

# Spartan-2 CLB

Carnegie Mellon

- Spartan-2 has 2 LUTs (4 input each) feeding a 3rd LUT, 2 FFs (with Preset/Reset, Enable, posedge or negedge clocks) and 16 muxes
  - 12 inputs (plus clock), 4 outputs, ~70 bits of configuration SRAM



(Courtesy Xilinx, Inc.)

# Spartan-3

Carnegie Mellon

- CLBs are composed of 4 *slices*
  - Organized as 2 pairs, one of which is optimized for memory access
- Each slice has 2 FFs and 2 LUTs
- Spartan-3 used in 18240 until S12 has 480 CLBs
  - ~200K Gate equivalent



(Courtesy Xilinx, Inc.)

# Spartan-7

Carnegie Mellon

- Each CLB has 2 slices
- Slice types vary, but half of them look like this
  - 4 LUTs (6 input)
  - 8 F/Fs
  - Arithmetic Carry Chain
- Top-of-the-line Spartan 7 has 8000 CLBs



# Routing

Carnegie Mellon



(Courtesy Xilinx, Inc.)

# Detailed routing

Carnegie Mellon

- Spartan-2

Each tiny box is a pass-transistor which can be programmed for connection (or not)



# Routing at the Switch Matrix

Carnegie Mellon



Each matrix has 5  
connections per side

(Courtesy Xilinx, Inc.)

# Memory Volatility

Carnegie Mellon

- Remember this picture from Memory lecture?
  - FPGA memory has to access all bits at all times
  - Unlike DRAM, which had a protocol to get to a single line
- Vendors use different memory types
  - Altera & Xilinx like SRAM (fast, easy to program)
  - Others use fuse or anti-fuse technology
- But, SRAM is volatile!
  - Yes, so it needs to be configured after each power-up
  - If you don't have a PC connected...
    - ◆ ... you can use a serial PROM to hold the bitstream
    - ◆ ... or CompactFlash / SD Card
    - ◆ ... or embedded processor
- Fuse (or anti-fuse) is non-volatile
  - But hard to erase



# FPGA Families extend Architecture

Carnegie Mellon

- Devices are built, with more capability, but around the same basic architecture

| Device                       | Max Logic Gates | Typical Gate Range | CLBs | Array   | User I/Os Max | Flip-Flops | Horizontal Longlines | Configuration Data Bits |
|------------------------------|-----------------|--------------------|------|---------|---------------|------------|----------------------|-------------------------|
| XC3020A, 3020L, 3120A        | 1,500           | 1,000 - 1,500      | 64   | 8 x 8   | 64            | 256        | 16                   | 14,779                  |
| XC3030A, 3030L, 3130A        | 2,000           | 1,500 - 2,000      | 100  | 10 x 10 | 80            | 360        | 20                   | 22,176                  |
| XC3042A, 3042L, 3142A, 3142L | 3,000           | 2,000 - 3,000      | 144  | 12 x 12 | 96            | 480        | 24                   | 30,784                  |
| XC3064A, 3064L, 3164A        | 4,500           | 3,500 - 4,500      | 224  | 16 x 14 | 120           | 688        | 32                   | 46,064                  |
| XC3090A, 3090L, 3190A, 3190L | 6,000           | 5,000 - 6,000      | 320  | 16 x 20 | 144           | 928        | 40                   | 64,160                  |
| XC3195A                      | 7,500           | 6,500 - 7,500      | 484  | 22 x 22 | 176           | 1,320      | 44                   | 94,984                  |

- Some additional capabilities
  - Low voltage versions
  - Faster clock rates
  - Different packaging options
- But same CLB design...
- ... same Routing mechanisms ...



(Courtesy Xilinx, Inc.)

# The need for more stuff

Carnegie Mellon

- CompEs cannot design on logic, routing, I/O alone
- Extreme case from early 90s
  - 16 port ATM switch, designed on a single board



- Design is limited by I/O to memory chips--bring them on-chip

# Other uses for LUTs

Carnegie Mellon

- 4 input LUT holds 16 bits for configuration
  - Here shown configured as a 4 input AND gate



- Create a shift register, just by changing the wires



- FFs also available as 16 bits of (distributed) RAM
- All 3 options (LUT4, SRL16, RAM16) are offered by the FPGA vendor -- you select by using the tools

# Other “Stuff”

Carnegie Mellon

- **Clock managers**

- Global clock buffering, distribution
- DCM: Digital Clock Manager
  - ◆ Eliminate skew
  - ◆ Phase shift, multiply or divide a clock
  - ◆ Condition a clock: get clean 50% duty cycle output



# Other “Stuff”

Carnegie Mellon

---

- **Memory**
  - Block RAM (Altera DE2-115 has 3,888 Kbits)
- **Dedicated Multiplexers**
- **Carry Look-Ahead Generators**
- **I/O Blocks**
  - SelectIO supports 18 standards (single, differential, various voltage levels, ....)
  - DE2-115 has 280 User I/O blocks
- **Embedded Multipliers**
  - 18x18-bit signed or unsigned multiply
  - Excellent for DSP applications
- **Processor Core**
  - Risc-V, Nios, MicroBlaze, PicoBlaze (soft cores)
  - ARM, Power PC hard cores

# IP: Intellectual Property

Carnegie Mellon

- **Someone else's design for you to put on FPGA**
  - Ex: USB circuitry, Ethernet, PCI Express, DSP algorithms
- **Protected against reverse engineering by encrypting the configuration bit stream**
  - FPGA has the key for decrypting on board
- **Very robust marketplace**
  - Xilinx tools cover ~700 IP cores, from dozens of vendors

# Where are we now?

Carnegie Mellon

- Virtex-7 is littered with “other stuff”
- CPU cores
- Specialized I/O
  - Ethernet
- Memory
  - Block RAM/FIFO
  - Distributed RAM
- Multiply-Accumulate
- Clock management



- Tool challenge: very hard for synthesis tools to make the right decisions

(Courtesy Xilinx, Inc.)

# Current chips have mucho resources

Carnegie Mellon

## One small section of the UltraScale+ Product Selection Guide

| Device Name                                 | VU9P  | VU11P | VU13P  | VU19P |
|---------------------------------------------|-------|-------|--------|-------|
| System Logic Cells (K)                      | 2,586 | 2,835 | 3,780  | 8,938 |
| CLB Flip-Flops (K)                          | 2,364 | 2,592 | 3,456  | 8,172 |
| CLB LUTs (K)                                | 1,182 | 1,296 | 1,728  | 4,086 |
| Max. Dist. RAM (Mb)                         | 36.1  | 36.2  | 48.3   | 58.4  |
| Total Block RAM (Mb)                        | 75.9  | 70.9  | 94.5   | 75.9  |
| UltraRAM (Mb)                               | 270.0 | 270.0 | 360.0  | 90.0  |
| DSP Slices                                  | 6,840 | 9,216 | 12,288 | 3,840 |
| Peak INT8 DSP (TOP/s)                       | 21.3  | 28.7  | 38.3   | 10.4  |
| PCIe® Gen3 x16                              | 6     | 3     | 4      | 0     |
| PCIe Gen3 x16/Gen4 x8 / CCIX <sup>(1)</sup> | –     | –     | –      | 8     |
| 150G Interlaken                             | 9     | 6     | 8      | 0     |
| 100G Ethernet w/ KR4 RS-FEC                 | 9     | 9     | 12     | 0     |
| Max. Single-Ended HP I/Os                   | 832   | 624   | 832    | 1,976 |
| Max. Single-Ended HD I/Os                   | 0     | 0     | 0      | 96    |
| GTY 32.75Gb/s Transceivers                  | 120   | 96    | 128    | 80    |

Over 4M LUTs

58M FFs

UltraScale+ slice is 8  
LUTs, 16 FFs

Recall: XC3K had 64  
CLBs (with 2 LUTs,  
2FFs each)

Each LUT is 6-input,  
single output

(Courtesy Xilinx, Inc.)

# Costs

Carnegie Mellon

---

- **FPGA prices are non-linear per unit of logic**
  - Affects design partitioning options
  - Optimization problem
- **Latest generation FPGAs are often leading the technology curve**
  - Cost includes paying the vendor / foundry R&D
  - Virtex UltraScale uses 16nm FinFET and Stacked Silicon Interconnect

# FPGA Opportunities

Carnegie Mellon

- **Design flexibility advantages are obvious**
  - Co-design: FPGA + CPU system
  - Automated decision of where functionality goes
- **Flexibility extends to life cycle**
  - Can be upgraded “in the field”
- **Run time reconfigurability**
  - Speed up program by configuring FPGA for specific tasks
  - Tough choices: synthesis + configuring takes lots of time
  - Xilinx calls this "Partial Reconfiguration"

# Partial Reconfiguration

Carnegie Mellon

- **Increases System Flexibility**

- Perform more functions while keeping some of the system stable (perhaps keeping communication links alive)



- **Allows for Size Reduction**

- Time-multiplex the hardware so you can use a smaller FPGA



- **Decreased power requirements**

- Shut down power-hungry tasks when not needed



# FPGA Summary

Carnegie Mellon

- FPGAs are cool -- flexibility of function
  - For any digital circuit, especially prototyping
  - plus partial reconfiguration, co-design, etc
  - and cheaper than ASICs for <.5M system count
- FPGA Architecture
  - Logic elements / CLB
    - ◆ LUT for combinational logic
    - ◆ FFs for synchronous logic (FSMs, Registers for RTL)
  - Routing
    - ◆ Get signal into / out of logic element
    - ◆ Get signal cross chip to another logic element
  - Input / Output
    - ◆ Get signal to an FPGA pin
  - Other stuff
    - ◆ Hard IP: microcontrollers, memory banks, clock managers, ...
    - ◆ Soft IP: microcontrollers, Firm IP, many vendors

# Where can you go next?

Carnegie Mellon

- Classes
  - 18-340, 18-341, 18-447, ...
- Build18
  - Hackathon in January
- 18-500 Capstone
  - Come see what sort of projects you will be building someday soon!
  - Public Demos



# Where can you go next?

Carnegie Mellon

- Undergrad Research
  - Not too early to think about doing some research ...
    - ◆ ... and learning what research is like
- Bug any professor on A-level or 2nd floor of Hamerschlag Hall
- Brandon Lucia ([blucia@ece.cmu.edu](mailto:blucia@ece.cmu.edu))
  - Architecture and Platform Design for Energy-harvesting Computer Systems
    - ◆ New computer architecture, simpler programming abstractions, ...
  - Satellites!!

Also ask:

Shawn Blanton  
James Hoe  
Priya Narasimhan  
David O'Halloran  
Raj Rajkumar  
Bryan Parno  
Anthony Rowe  
Brandon Lucia  
Franz Franchetti  
Ken Mai

Nathan Beckmann  
(CS) is also a likely candidate

ALSO: Student Project Tracker at [spt.apps.ece.cmu.edu](http://spt.apps.ece.cmu.edu)  
has lots of projects for pay/credit with ECE Faculty

# Finally...

Carnegie Mellon

- **Thanks for an enjoyable semester**
- **Please give us (Bill, Tom, TAs) feedback**
  - FCEs
  - TA Feedback Form
  - Email
  - Discussion
  - ... whatever and whenever
- **And, keep in touch!**