



# Reconfigurable Instruction Set Extensions using FABulous eFPGAs - when and how

[dirk.koch@ziti.uni-Heidelberg.de](mailto:dirk.koch@ziti.uni-Heidelberg.de)

# Accelerator Coupling



## Loosely Coupled

- For „large“ acceleration tasks
  - Compress image / video
  - Encrypt
- Called through drivers
- Hardware-centric (mostly stand alone processing)
- Complex to design



## Tightly Coupled

- For „small“ acceleration tasks
  - Parity, count-ones
  - CRC
- Called in user mode
- **Software-centric**  
(fine-grained function calls)
- Easy to design

# Reconfigurable Instruction Set Extensions

- Present GP CPU micro architectures leave not much headroom for optimization
- CPU clock is limited by power
  - trend to feature-rich instruction sets and acceleration

Crypto



- Note that **instructions have a shelf life**

# Possibilities to add Custom Instructions

---

- Instructions implemented **statically** in hardware
- Instructions implemented with **reconfigurable** hardware (e.g., eFPGA)
- **ISA subsetting**: kick out unused instructions (usually for FPGAs only)
- Instructions **emulated in software** (may speed-up your system)

## Why/How?

- Adding instructions can make your CPU run slower!
  - removing instructions may make your CPU faster
  - will benefit all instructions!
  - can offset the cost for software emulation (if instr. triggered seldomly)

# Possibilities to add Custom Instructions

---

- **Static** Instructions
- **Reconfigurable** Instructions
- ISA subsetting
- Software emulation

**When what?**

| <b>Execution pattern</b> | <b>Implementation technique</b> |
|--------------------------|---------------------------------|
| Frequent                 | Static hardware                 |
| Burst                    | Reconfigurable hardware         |
| Infrequent               | Software emulated               |
| Non-occurring            | Removed                         |

# Possibilities to add Custom Instructions



- Sliding window approach
  - Slide a window of size  $t_{reconfiguration}$  over the instruction stream (for each instruction separately)
  - Empty spots in the filtered trace mean reconfiguration is feasible

# Possibilities to add Custom Instructions



- Configuration is relatively slow → we need coarse scheduling granularity
- Perhaps at program level (custom HW Kernels)

# Reconfigurable Instruction Set Extensions

- Let's replace the NEON vector unit with an FPGA fabric of ~identical size (i.e. 2080 LUTs, 16 DSPs, 8 BRAMs)
- Interesting for low precision SIMD arithmetic (128 bits allow 42 3-bit multiplications costing 1764 LUTs)

Dual ARM A9 SoC Floorplan



Zynq chip with ARM SoC



LUT DSP BRAM

# Reconfigurable Instruction Set Extensions

- The logic resources are about one 32-bit softcore CPU
- Vector interface allows more operands and results and...
- ...allows catching up with the faster hardened part



*soft-NEON: A study on replacing  
the NEON engine of an ARM SoC  
with a reconfigurable fabric*  
ASAP 2016

# Tightly Coupled Reconfigurable Instructions

## Candidates:

- `int foo (int OP_A, OP_B);`
- DES, AES, SHA1-3, MD5  
Montgomery, CRC, ...
- Hash functions
- (De)Compression  
(Huffman, bit-level)
- Consider internal registers / register files
- Not bound to 2 x input, 1 x output
  - `int foo1 (int OP_A, OP_B); int foo2 (int OP_C, OP_D);`
  - push / pop
- Breakpoints, watchpoints (complex triggering), event counters
- Replacing defect operations???

Peer et al. "Human Skin Colour Clustering for Face Detection"

$(R, G, B)$  is classified as skin if:  
 $R > 95$  and  $G > 40$  and  $B > 20$  and  
 $\max\{R, G, B\} - \min\{R, G, B\} > 15$  and  
 $|R - G| > 15$  and  $R > G$  and  $R > B$

# Study: Soft-NEON

- An Interlay is reconfigurable!
  - ISA subsetting
  - Vector width customization
  - Operation folding



# Custom Interlay – RISC-V Prototype

- As a test vehicle, we implemented a dual-core RISC-V with a **shared Interlay**



Cortex-A9 SoC Floorplan  
NEON Area: 2080 LUTs, 16  
DSPs, 16 BRAMs[2]



Dual-Core RISC-V with a PR  
shared custom unit floorplan  
PR Shared Area: 2082 LUTs  
Slot: 694 LUTs



Dual-Ibex-Crypto-eFPGA for  
cryptography  
Google Shuttle - MPW4  
<https://github.com/nguyendao-uom/ICESOC>

# Implementation



- Easy method: tap into the operands and multiplex in a result  
→ may require different number clock cycles to evaluate



# Reconfigurable Instruction Set Extensions

The FlexBex (Ibex with eFPGA) approach:

We use the following instruction encoding:

**eFPGA**[*result\_select*]d[*delay*] dest, RS1, RS2 // *delay*: 0...15 cycles

- Register manipulation instruction: dest  $\leftarrow$  RS1 OP RS2
- Inline assembly:

```
int ina, inb, result;  
ina = 10;  
inb = 20;  
asm volatile  
(  
    "eFPGA3d2%[z],%[x],%[y]\n\t"  
    : [z] "=r" (result)  
    : [x] "r" (ina), [y] "r" (inb)  
)
```

# Reconfigurable Instruction Set Extensions

---

**Alternative: configurable instruction encoding:**

| Instruction | Slot                   | Delay slots      |
|-------------|------------------------|------------------|
| Encoding    | [NULL/trap, 1, 2, ...] | [const, dynamic] |
| Encoding    | [NULL/trap, 1, 2, ...] | [const, dynamic] |
| ...         |                        |                  |

**Idea:**

- **After configuring the CI, the OS/driver writes some registers to instruct the CPU how to decode/use that instruction**
  - allows relocating instructions to different slots (defragmentation!)
  - allows slot-dependent latency
  - maybe key to port instructions to different process nodes
- **Problem: we can have an infinite number of CIs (overload Encoding?)**

# ISA Extensions – Discussion

---

## What we haven't discussed:

- **Stateless versus stateful instructions**
  - how do we do context switches? (an OS must be able to discover this)
  - does it need to support reentrant mode? (for recursion or multiple threads)
  - Simple case: **MULACC**:  $acc \leftarrow acc + (RS1 \times RS2)$   
(needs some thought how to recover state after context switch)
- Prevent Deadlocks if instruction isn't available
  - software fallback (usually traps)
- How do we control configuration?
  - explicit per config request?
  - implicit per trap? (e.g., run n-times emulation before configuration)

# ISA Extensions – Discussion

---

## What we haven't discussed:

- Do we want to share instructions among cores?
  - creates resource conflict (needs arbitration)
  - creates extra latency
  - may that cause deadlocks?
  - security (e.g., Spectre-kind of attacks)?
  - much more complex but possibly better resource utilization
- Can tasks move to different cores?
  - needs moving the configuration
- Can customs instructions be ported among systems?
- Can we support iteration intervals of 1?

# ISA Extensions – Discussion

---

## What we haven't discussed:

- How do allocate reconfigurable resources among CIs?
  - we may have multiple Tasks
- Must be implemented efficient (or the benefit of CIs will be offset)
  - instruction cycles & code density
    - e.g.: if a CI saves 20 instructions each extra cycle eats 5% efficiency
    - worse in reality as we are not firing the CI all the time
    - CIs are usually doing more work but also more latency...
- In summery: the potential using reconfigurable CIs is huge but it is non-trivial to feature this in a full-blown OS with multicore, multi tenancy, etc. support.



# Reconfigurable Instruction Set Extensions using FABulous eFPGAs - when and how

# What is FABulous offering?

---

- Fully integrated open-source FPGA framework with **good quality of results** (area & performance)
- Entirely open and free, including commercial use  
(we integrated many other projects: Yosys, ABC, OpenRAM)
- Supports **custom cells** (if provided) → some tooling is on the way
- Supports **partial reconfiguration**
- Designed for **ease of use** while providing **full control as needed**
- **Versatile**
  - Different flows (OpenLane ↔ Cadence) (Yosys/nextpnr ↔ VPR)
  - Easy to customize, including the **integration of own IP**

# The FABulous Framework

- Fully integrated framework for eFPGAs

Uses many projects:

- Yosys & ABC
- nextpnr
- OpenLANE
- VPR
- OpenRAM
- Verilator



# FPGA Basics – Logic

Look-up tables (LUTs) as the basic building block for implementing logic



|   | $A_3, A_2, A_1, A_0$ | LUT-value<br>AND gate | LUT-value OR<br>gate |
|---|----------------------|-----------------------|----------------------|
| 1 | 0000                 | 0                     | 0                    |
| 0 | 0001                 | 0                     | 1                    |
| 1 | 0010                 | 0                     | 1                    |
| 2 | 0011                 | 0                     | 1                    |
| 3 | 0100                 | 0                     | 1                    |
| 4 | 0101                 | 0                     | 1                    |
| 5 | 0110                 | 0                     | 1                    |
| 6 | 0111                 | 0                     | 1                    |
| 7 | 1000                 | 0                     | 1                    |
| 8 | 1001                 | 0                     | 1                    |
| 9 | 1010                 | 0                     | 1                    |
| A | 1011                 | 0                     | 1                    |
| B | 1100                 | 0                     | 1                    |
| C | 1101                 | 0                     | 1                    |
| D | 1110                 | 0                     | 1                    |
| E | 1111                 | 1                     | 1                    |
| F |                      |                       |                      |

# Routing

- FPGAs are made vastly of:
  - (wide) **Multiplexers**
  - Configuration **Latches**
- Customizing these *tactical cells*\* provides most efficiency gain



\*Victor Aken'Ova. 2005. Bridging the Gap Between Soft and Hard eFPGA Design. MSc Thesis. UBC

# LUTs help with the routing (pin swaps are for free)

| A <sub>3</sub> | A <sub>2</sub> | A <sub>1</sub> | A <sub>0</sub> |  |  |  |  |  |
|----------------|----------------|----------------|----------------|-----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| 0:             | 0              | 0              | 0              | 0                                                                                 | 0                                                                                   | 0                                                                                   | 0                                                                                   | 0                                                                                   |
| 1:             | 0              | 0              | 0              | 1                                                                                 | 0                                                                                   | 0                                                                                   | 1                                                                                   | 0                                                                                   |
| 2:             | 0              | 0              | 1              | 0                                                                                 | 0                                                                                   | 1                                                                                   | 0                                                                                   | 0                                                                                   |
| 3:             | 0              | 0              | 1              | 1                                                                                 | 1                                                                                   | 1                                                                                   | 1                                                                                   | 0                                                                                   |
| 4:             | 0              | 1              | 0              | 0                                                                                 | 1                                                                                   | 0                                                                                   | 0                                                                                   | 0                                                                                   |
| 5:             | 0              | 1              | 0              | 1                                                                                 | 1                                                                                   | 0                                                                                   | 1                                                                                   | 0                                                                                   |
| 6:             | 0              | 1              | 1              | 0                                                                                 | 1                                                                                   | 1                                                                                   | 0                                                                                   | 0                                                                                   |
| 7:             | 0              | 1              | 1              | 1                                                                                 | 1                                                                                   | 1                                                                                   | 1                                                                                   | 0                                                                                   |
| 8:             | 1              | 0              | 0              | 0                                                                                 | 0                                                                                   | 0                                                                                   | 0                                                                                   | 1                                                                                   |
| 9:             | 1              | 0              | 0              | 1                                                                                 | 0                                                                                   | 0                                                                                   | 1                                                                                   | 1                                                                                   |
| A:             | 1              | 0              | 1              | 0                                                                                 | 0                                                                                   | 1                                                                                   | 0                                                                                   | 1                                                                                   |
| B:             | 1              | 0              | 1              | 1                                                                                 | 0                                                                                   | 1                                                                                   | 1                                                                                   | 1                                                                                   |
| C:             | 1              | 1              | 0              | 0                                                                                 | 1                                                                                   | 1                                                                                   | 0                                                                                   | 1                                                                                   |
| D:             | 1              | 1              | 0              | 1                                                                                 | 1                                                                                   | 1                                                                                   | 1                                                                                   | 1                                                                                   |
| E:             | 1              | 1              | 1              | 0                                                                                 | 1                                                                                   | 1                                                                                   | 0                                                                                   | 1                                                                                   |
| F:             | 1              | 1              | 1              | 1                                                                                 | 1                                                                                   | 1                                                                                   | 1                                                                                   | 1                                                                                   |

# FPGA Basics – FPGA Fabric

- Example of an FPGA fabric composed of LUTs, switch matrices and I/O cells. Other common primitives: memories, multipliers, transceivers, ...



# FPGA Basics – FPGA Fabric

---





Fabric Elements HDL Code

```
1 // Copyright 2021 University of Manchester
2 //
3 // Licensed under the Apache License, Version 2.0 (the "License");
4 // you may not use this file except in compliance with the License.
5 // You may obtain a copy of the License at
6 //
7 // http://www.apache.org/licenses/LICENSE-2.0
8 //
9 // Unless required by applicable law or agreed to in writing, software
10 // distributed under the License is distributed on an "AS IS" BASIS,
11 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 // See the License for the specific language governing permissions and
13 // limitations under the License.
14
15 (*FABulous, BelMap,
16 INIT_0,
17 INIT_1=1,
18 INIT_2=2,
19 INIT_3=3,
20 INIT_4=4,
21 INIT_5=5,
22 INIT_6=6,
23 INIT_7=7,
24 INIT_8=8,
25 INIT_9=9,
26 INIT_10=10,
27 INIT_11=11,
28 INIT_12=12,
29 INIT_13=13,
30 INIT_14=14,
31 INIT_15=15,
32 FF=16,
33 IOmux=17,
34 SET_NORESET=18
35 *)
36 module LUT4c_frame_config_dffesr (I0, I1, I2, I3, O, Ci, Co, SR, EN, UserCLK, ConfigBits)
37 parameter NoConfigBits = 19 // has to be adjusted manually (we don't use an arithmetic
38 // IMPORTANT: this has to be in a dedicated line
39 input I0 // LUT inputs
40 input I1
41 input I2
42 input I3
43 output O // LUT output (combinatorial or FF)
44 input Ci // carry chain input
45 output Co // carry chain output
46 input SR // SHARED_RESET
47 input EN // SHARED_ENABLE
48 (* FABulous, EXTERNAL, SHARED_PORT *) input UserCLK // EXTERNAL // SHARED_PORT // ##
49 // GLOBAL all primitive pins that are connected to the switch matrix have to go before the
50 (* FABulous, GLOBAL *) input [NoConfigBits-1 : 0] ConfigBits
51
52 localparam LUT_SIZE = 4
53 localparam N_LUT_flops = 2 ** LUT_SIZE
54
55 wire [N_LUT_flops-1 : 0] LUT_values
56 wire [LUT_SIZE-1 : 0] LUT_index
57 wire LUT_out
58 reg LUT_flop
59 wire IOmux // normal input '0', or carry input '1'
60 wire c_out_mux, c_IOmux, c_reset_value // extra configuration bits
61
62 assign LUT_values = ConfigBits[15:0]
63 assign c_out_mux = ConfigBits[16]
64 assign c_IOmux = ConfigBits[17]
65 assign c_reset_value = ConfigBits[18]
66
67 //CONFout <= c_IOmux;
68
69 //assign IOmux = c_IOmux ? Ci : IO;
70 my_mux2 my_mux2 IOmux(
71 .AO(IO),
72 .A1(Ci),
73 .S(c_IOmux),
74 .X(IOmux)
```

# Switch matrix

- 1. LUT input muxes
- 2. Constant input value
- 3. LUT and Flop output muxes
  - Rest: local routing
  - Virtex II
    - 332 inputs
    - 160 multiplexer
  - Virtex V
    - 305 inputs
    - 172 Multiplexer



# Rough Cost Estimate

| In total per CLB |         |                        |                           |          | 12,560 | 1,600      |
|------------------|---------|------------------------|---------------------------|----------|--------|------------|
| Virtex-6         |         | Transistors/mux        |                           | In total |        |            |
| Routing resource | # muxes | # inputs               | Pass                      | Conf.    | Trans. | Conf. bits |
| LUT inputs       | 48      | 24                     | 29                        | 50       | 3,792  | 480        |
| FF_in            | 8       | 28                     | 34                        | 55       | 712    | 88         |
| D_in             | 4       | 22.5*                  | 27.5                      | 50       | 310    | 40         |
| CLK and GFAN     | 4       | 14                     | 18                        | 40       | 232    | 32         |
| WE CR SE         | 2 2 2   | 24 23 12               | 29 28 16                  | 50 50 35 | 416    | 54         |
| Local routing    | 96      | 20.5*                  | 25.5                      | 50       | 7,248  | 960        |
| Longlines        | 4       | 12*                    | 16                        | 35       | 204    | 28         |
| Sum routing      |         |                        |                           |          | 12,914 | 1,682      |
| LUT truth table  |         | Trans.: $8 \times 448$ | Conf. bits: $8 \times 64$ |          | 3,584  | 512        |
| In total per CLB |         |                        |                           |          | 16,498 | 2,194      |

\*Average value

# FPGA Configuration

- The easiest way to implement configuration storage is using a shift register
- Bit-wise addressing is way too expensive!
  - frame-based reconfiguration
- But how do we update individual switch matrix multiplexers?



# FPGA Configuration (as used in FABulous)



# Basic concepts



- Basic tiles have same height, but type-specific width (for logic tiles, DSPs, etc.)
- Adjacent tiles can be fused for more complex blocks (see the DSP example) → **Supertile** #

# Basic concepts

LUT column



DSP column

IO column

NULL tile

IO\_pad\_cell



- I/Os belong logically to the fabric but are physically routed to the surrounding
- Internal wires, buses, etc. are „just“ wires at the border of the fabric

# Let's build a small eFPGA: Fabric Definition



- 4 x register file, 2 x DSPs, 4 x LUTs (CLB), I/Os left and right,

# Let's build a small eFPGA: Fabric Definition



- 4 x register file, 2 x DSPs, 8 x LUT-tiles (CLB), I/Os left and right,
- A fabric is modelled as a spreadsheet (tiles are references to tile descriptors)

# Let's build a small eFPGA: Tile Definition



- Wires
- Primitives (basic elements)
- Switch matrix
- Configuration storage

# Let's build a small eFPGA: Tile Definition



- Wires
- Primitives (basic elements)
- Switch matrix
- Configuration storage

# eFPGA Ecosystem – Tile/Wire Definitions

|    |            |                        |          |          |                   |
|----|------------|------------------------|----------|----------|-------------------|
| 10 | TILE       | CLB                    |          |          |                   |
| 11 | #direction | source_name            | X-offset | Y-offset | destination wires |
| 12 | NORTH      | N1BEG                  | 0        | 1        | N1END 3           |
| 13 | NORTH      | N2BEG                  | 0        | 2        | N2END 3           |
| 14 | EAST       | E1BEG                  | 1        | 0        | E1                |
| 15 | EAST       | E2BEG                  | 2        | 0        | E2                |
| 16 | SOUTH      | S1BEG                  | 0        | -1       | S1                |
| 17 | SOUTH      | S2BEG                  | 0        | -2       | S2                |
| 18 | WEST       | W1BEG                  | -1       | 0        | W1                |
| 19 | WEST       | W2BEG                  | -2       | 0        | W2                |
| 20 | JUMP       | J_BEG                  | 0        | 0        | J_                |
| 21 | BEL        | clb_slice_4xLUT4.vhdl  |          |          |                   |
| 22 | MATRIX     | CLB_switch_matrix.VHDL |          |          |                   |
| 23 | EndTILE    |                        |          |          |                   |



- Wires are defined by  
<direction> <symbolic begin|end names> <target offset> <# wires>
- Jump wires for hierarchical routing (Intel/Altera and Xilinx UltraScale style)

# eFPGA Ecosystem – Switch Matrix Definition

```
1 # LUT4AB
2 # double with MID cascade : [N,E,S,W]2BEG --- [N,E,S,W]2MID -> [N,E,S,W]2BEGb --- [N,E,S,W]2END (
3 [N|E|S|W]2BEGb[0|1|2|3|4|5|6|7], [N|E|S|W]2MID[0|1|2|3|4|5|6|7]
4
5 ##### LUT Inputs #####
6 ##### LUT Inputs #####
7 ##### LUT Inputs #####
8
9 # shared double MID jump wires
10 J2MID_ABa_BEG[0|0|0|0],[JN2END3|N2MID6|S2MID6|W2MID6]
11 J2MID_ABa_BEG[1|1|1|1],[E2MID2|JE2END3|S2MID2|W2MID2]
12 J2MID_ABa_BEG[2|2|2|2],[E2MID4|N2MID4|JS2END3|W2MID4]
13 J2MID_ABa_BEG[3|3|3|3],[E2MID0|N2MID0|S2MID0|JW2END3]
14
15
16 # Carry chain Ci -> LA_Ci-LA_Co -> LB_Ci-LB_Co -> ... ->
17 LA_Ci,Ci0
18 L[B|C|D|E|F|G|H]_Ci,L[A|B|C|D|E|F|G]_Co
19 Co0,LH_Co
```

|    | A      | B      | C      | D      | E      | F    |
|----|--------|--------|--------|--------|--------|------|
| 1  | CLB    | N1END0 | N1END1 | N1END2 | N2END0 | N2EN |
| 2  | N1BEG0 | 0      | 1      | 1      | 1      | 1    |
| 3  | N1BEG1 | 1      | 0      | 1      | 0      | 0    |
| 4  | N1BEG2 | 1      | 0      | 1      | 0      | 1    |
| 5  | N2BEG0 | 0      | 1      | 0      | 1      | 0    |
| 6  | N2BEG1 | 1      | 0      | 0      | 0      | 0    |
| 7  | N2BEG2 | 1      | 1      | 1      | 0      | 0    |
| 8  | N4BEG0 | 0      | 1      | 0      | 1      | 1    |
| 9  | N4BEG1 | 1      | 1      | 1      | 1      | 1    |
| 10 | E1BEG0 | 1      | 0      | 1      | 0      | 1    |
| 11 | E1BEG1 | 1      | 1      | 0      | 1      | 1    |

- Describes the adjacency in a symbolic way  
<mux\_output>,<mux\_input>
- Alternatively adjacency matrix

# The FABulous eFPGA Ecosystem

- FABulous eFPGA generator
  - ASIC RTL and constraints generation
  - Generating models for nextpnr/VPR flows
  - FPGA emulation
- Virtex-II, Lattice clones (patent-free!)
- See our FPGA 2021 paper „FABulous: An Embedded FPGA Framework“



# The first open-everything FPGA

---

- Built using open tools  
(Yosys, OpenLane, Verilator...)
- Open PDK  
(Skywater 130 process)
- Google Shuttle (MPW5):  
[https://github.com/nguyendao-uom/open\\_eFPGA](https://github.com/nguyendao-uom/open_eFPGA)



# FABulous Chip Gallery

---



Sky130 with CLBs, DSPs,  
RegFiles, BRAMs  
Google Shuttle - MPW-2  
(can implement RISC-V)

[https://github.com/nguyendao-uom/eFPGA\\_v3\\_caravel](https://github.com/nguyendao-uom/eFPGA_v3_caravel)

# FABulous Chip Gallery

---



Sky130 with CLBs, DSPs,  
RegFiles, BRAMs  
Google Shuttle - MPW-2  
(can implement RISC-V)

[https://github.com/nguyendao-uom/eFPGA\\_v3\\_caravel](https://github.com/nguyendao-uom/eFPGA_v3_caravel)



# FABulous Chip Gallery

---

Dual-Ibex-Crypto-eFPGA  
Google Shuttle - MPW-4  
(custom instructions,  
T-shaped fabric)

<https://github.com/nguyendao-uom/ICESOC>



# FABulous Chip Gallery

## Open ReRAM FPGA test chip

- Sky130, Google Shuttle  
MPW4[https://github.com/nguyendao-uom/rram\\_testchip](https://github.com/nguyendao-uom/rram_testchip)
- Just enough logic to send „Hello World“ to a UART
- Different configuration modes

## Possible advantages of ReRAM FPGAs

- Security (user circuit is encoded in resistive states)
- Reliability (ReRAM is radiation hard)
- Probably density
- Instantaneous on
- CMOS friendly



# FABulous versus OpenFPGA (on Sky130)



- FABulous and OpenFPGA have a Google Shuttle2 submission
- ~ same physical impl. problem
- OpenFPGA CLBs are 17% bigger
- New optimizations gave us further 21.7% in density on the same netlist!

OpenFPGA  
FABulous

resources

|                                       |
|---------------------------------------|
| 1300xMUX2 / 530xDFF                   |
| 376xMUX4 / 46xMUX2 / 8xFF / 586xlatch |
| ~1200 MUX2                            |

# The FABulous eFPGA Framework – Wrap-up

---

- Heterogeneous (FPGA) fabric (DSBs, BRAMs, CPUs, **custom blocks**)
  - Multiple tiles can be combined for integrating more complex blocks
  - Custom blocks can be instantiated directly in Verilog and are integrated in Yosys, VPR/nextpnr CAD tools (Synthesis, Place&Route) (as primitive blocks)
- Support for dynamic **partial reconfiguration**  
(some elements of XC6200, like wildcard configuration)
- Configuration through shift registers or **latches** (or custom cells)
- Support for **custom cell primitives** (passtransistor multiplexers)
- **Good performance / area / power figures** (about 1.5x worse than Xilinx)  
(could be narrowed down through customization)
- **Usable by FPGA users** (you don't have to be an FPGA architect)  
→ there are FPGA classics that we have/will clone
- ToDo: multiple clock domains, **mixed-grained granularity**, ...

# FABulous Contributors

## People:

|                |                                        |
|----------------|----------------------------------------|
| Andrew Attwood | a.j.attwood@ljmu.ac.uk                 |
| Asma Mohsin    | asma.mohsin@stud.uni-heidelberg.de     |
| Bea Healy      | tabitha.healy@student.manchester.ac.uk |
| Dirk Koch      | dirk.koch@manchester.ac.uk             |
| Gennadiy Knies | gennadiy.knis@stud.uni-heidelberg.de   |
| Jakob Ternes   | jakob.ternes@stud.uni-heidelberg.de    |
| Jing Li        | jing.li@manchester.ac.uk               |
| Jonas Künstler | jonas.kuenstler@stud.uni-heidelberg.de |
| Kelvin Chung   | king.chung@student.manchester.ac.uk    |
| Marcel Jung    | marcel.jung@stud.uni-heidelberg.de     |
| Myrtle Shah    | gatecat@ds0.me                         |
| Nguyen Dao     | nguyen.dao@manchester.ac.uk            |



FORTE

Forming the future  
of nanoelectronics

**EPSRC**

Engineering and Physical Sciences  
Research Council



Carl Zeiss  
Stiftung

GEFÖRDERT VOM



Bundesministerium  
für Bildung  
und Forschung

See our projects under: <https://github.com/FPGA-Research-Manchester>

This work is kindly supported by the UK Engineering and Physical Sciences Research Council (EPSRC)  
under grant EP/R024642/1 and Carl-Zeiss-Stiftung

# FPGA Basics – Logic

- Look-up tables (LUTs) are basically multiplexers selecting configuration latches storing a function as a simple truth table
- Configuration latches are usually written through the configuration port only
- In distributed memory options (LUT is used as a shift register or memory file, table is also writeable through the user logic)



# FPGA Configuration

- Do not use shift register configuration

- High power during configuration (thousands of bits)
- Configuration only valid if completely shifted in (transient short-circuits or ring-oscillators)
- Cannot do „real“ partial reconfiguration (static routes through reconfigurable regions)
- Too expensive (shift registers need flip flops, frame-based configuration can do with latches)



# FPGA Basics – Routing (Virtex-II style)



- AMD/Intel use multiple levels of one-hot encoded routing with pass-transistors
- **Multiple activated inputs can cause short-circuit situations**
  - this is why you should blank a region before overwriting it with a new module
  - less of a problem for encoded bitstreams (not one-hot encoded)

# Tile-based Design in FABulous

- Replace standard cell multiplexers with custom mux-4

$$A_{\text{std-cell}} - A_{\text{c-mux4}} \times N = (33.8 \mu\text{m}^2 - 17.5 \mu\text{m}^2) \times 376 = 6,116 \mu\text{m}^2$$

|     | Standard cell |        |        |       | Custom mux-4 |       |
|-----|---------------|--------|--------|-------|--------------|-------|
|     | height        | width  | area   | util. | area         | util. |
| CLB | 219 μm        | 219 μm | 47,961 | 81.8% | 46,225       | 60.7% |
| REG | 219 μm        | 214 μm | 46,866 | 84.1% | 46,655       | 64.3% |
| DSP | 443 μm        | 185 μm | 81,955 | 80.9% | 81,780       | 56.7% |

Observation:

- No area improvement
- Instead: core utilization went down  
→ Congested tile routing

# In short

---



# Optimization: Bitstream Remapping



- The configuration bit cells may induce inferior placement of multiplexers
- We can remap configuration bits → requires remapping of the bitstream (trivial)

# Optimization: Bitstream Remapping



FD=FrameData  
FS=FrameStrobe  
L=Latch

- We use Google's Operations Research tools to compute the grid points (<https://github.com/google/or-tools>)

# Optimization: Bitstream Remapping

