

Github: noah95/formulasheets

## I. Introduction

### I.1 Economic aspects

There are 4 business models:

- IDM: Integrated design manufacturer Chip designer and fab (Intel) (Semiconductor, Luxtera)
- Fab-lite Fabless chip vendor (Qualcomm)
- Chip, Fab + Device (Samsung)
- Fab only (TSMC)

### POVs

#1 records: Fin technology  $> 10^9$  transistor count

#2 market: General purpose vs. application specific

#3 manufacturing: Semi custom: Given gate array, add custom metal = custom circuit

FPL Field programmable logic (FPGA)

LUT/FF that are connected by switches  
Full-custom and standard part

#5 Business models: 4 from ↑ plus IP-vendors and system house: integrates HW/SW into product

### #4 Design engineer

Standard cell: FF / LUT

Mega cell: ADC / DAC

Macro cell: RAM / ROM

## II. Designflow & FPGA

### Y-Chart



### Major stages

#### 1. System-level design

Alternatives, ~~fixed~~ Functionality, interfaces

→ sys.-level model

#### 2. Algorithm design

Allocation, scheduling, binding

→ Bit-true SW model

#### 3. Architecture design

→ Floorplan, RTL code

#### 4. Physical design

→ masks, verification

### FPGA

LUT perform logic operations

FF register element stores result from LUT

Interconnect/Wires convert elements to one another,  
both logic and clock

IO physically available ports get signals in and out  
of FPGA

special BRAM for memory, DSP, PLL

- memory
- Block RAM dedicated memory resource
  - LUT as RAM interconnected
  - FFs in CLBs used as RAM

1. Elaboration Generates netlist of generic logic and arithmetic elements from RTL description
2. Synthesis Maps netlist components to elements available on FPGA and directly connect. Results in netlist and timing/resource estimates
3. Implementation Synthesized netlist mapped onto target device: Place & route
4. Gen. Bitstream Generate sequence of bits to program device

## III. HDL Design & System Verilog

### About

- Case sensitive • logic values 0,1,X,Z • Compact
- Weak type checking • for structural, behavioral, timing and testbenches (all)

### Hierarchy

module name of block, parameters and signal names. Practice: One file per module

```
module top #(
    parameter int width = 16
)
(
    input logic clk_i,
    output logic [width-1:0] res_o
);
```

Body Describes functions, wires and instances

```
logic [width-1:0] test;
ffs #(width(16)) i_res (.clk_i(clk_i), .out_o(res_o));
assign foo = a | b;
```

## Wires

0,1 driven value 'X' drive-conflict, don't care  
'Z' High impedance

Main type: `logic` for I/O, signals, constants

If 'Z' required, use `wire` or `reg` to R/W to

## Arrays:

packed: `logic [7:0] vec` ordered MSB to LSB

unpacked: `int foo [0:15]` can be ordered both ways

$$\text{int } \text{foo}[8] \equiv \text{int } \text{foo}[0:7]$$

address array: `first = vec[0]` `range = vec[2:0]`

concat: `y = {a[2], a[1], a[0], a[0]}` `y = {4{a[0]}}`

Expressing numbers  $N^B_{xx}$  `4'b0011`

N: number of sets

B: base binary, hex, decimal, octal

xx: number '-' are ignored  $\rightarrow$  readability

more types `shortint` 2 state 16bit `int` 2state 32bit

`longint` 2state 64bit `byte` 2 state 8bit/ASCII

`integer` 4state 32bit `time` 4state 64bit unsigned

`int unsigned` `int signed`

Subtypes `typedef logic [7:0] u8_t; u&t_a;`

## Waving

-i in -o out -t type -r asynch. rst

-c clock -n active low `rst_n` combination

## logic

$\sim$  not & and & or & xor  $\checkmark$  ternary operator

Multiplexer: `assign res = sel ? a : b;`  
`assign res = (sel == 2'd1)? d1 : (sel == 2'd0)? ...`

logical: ! not && and & or == equal != not equal

## Sequential

Old Verilog style  
is only executed if signal in sensitivity list changes

```
always @ (posedge clk, rst) begin
    // ...
end
```

## New SV Style

`always_comb` for combinatorial stuff, no sens. list  
use blocking '`=`' operator

`always_latch` latches (bad)  
`always_ff` most used for seq. } use '`l=`'

## Assignments

`=` Blocking, signal assigned before next line is executed  
`l=` unblocking, assigned at end of process

Naming  
-d next state signals, assign in `always_comb`  
-q current state, assign in `always_ff`

## Enum

```
typedef enum logic [1:0]{Init, Run, Stop, Wait} state_t;
state_t state_d, state_q;
// ...
if (state_q == Run) begin
    state_d = Wait
end else if (...) -
```

## case

```
case (state_q)
    Run: state_d = Wait;
    Stop, Wait: begin
        // ...
    end
    default: // ...
endcase
```

More simple:  
`enum logic [a,b,c] state_d;`  
if A 1st state.

## FSM



Moore  
state transition diagram



Mealy

3 parts:  
- next state logic  
- state holding element  
- output logic

## IV. Verification

Functional V. Given set of inputs, what is expected output?

Formal V. All requirements described mathematically proven formally, then

Directed Testing choose small number of input vectors (common, edge cases and random)

`$random(seed);` returns random int  
`$urandom(15,5);` random int within range

Coverage Checks if all lines of code have been reached, how many times a bit toggled, etc...

## Simulation

ATI-Timing specifies clock period, stimuli application time and response acquisition time

$\uparrow$  order operation  $\Delta$  apply inp. T test outp  
tap stim. appl. time  $\rightarrow$  resp. acq. time



tck clock period tpa propagation delay  
tsu setup time tno hold time  
tcd contamination delay relevant for

timax = tck - tsu max input delay  
tonmax = tck - tpd max output delay

## Testbench



# Formal

property fact about design assert directive to prove property

concurrent: assert property  $addr > 0$ ;

assert specify prop. that will hold true

cover check if prop. has occurred and count

assume defines constraint for formal verifier

assert property  $(@posedge clk) a \rightarrow b$ ;

$a \rightarrow b$  if a is true b should hold true

$a \rightarrow b$  if a is true b should hold true in next cycle

$a \rightarrow \#1 b$  2 cycles after a

$a \#1 c \#3 d \rightarrow b$  if first a, c, d then b in next (use consecutive)

$a[*3] \#1 \rightarrow$  repeat x3 with 1 delay

$\#E[4]$  delay stmt. two values, \$ for unbounded

! assertions require high CPU effort!

## SystemVerilog

Timing  
localparam time  $TC = 100ns$ ;  
timeunit  $1us$ ; // default time  
timeprecision  $0.1ns$ ;  
timescale  $1ns/100ps$ ; //old

#10; delay 10 time units

#(PER/2); only executed once

@(rst);  
@(posedge clk); wait for condition  
initial begin  
a = 1'b1;  
#10;  
a = 1'b0;  
end

wait (rst); \$info("reset");  
always begin  
clk=1; #(PER/2); clk=0; #(PER/2);  
end

## File 10

integer fd = \$fopen("file", "r");  
string in;  
integer u = \$ffets(in, fd);  
\$fclose(fd);

while (! \$feof(fd)) begin  
@(posedge clk)  
#STM\_APPL\_TIME;  
\$fsync(fd, "w", stim);  
end

010111100000  
a b c  
Generate

generate  
for (genvar ii=0; ii<num; ii++) begin : gen\_bvs  
my\_bvs #(index(ii)) i\_my\_bvs (.foo(foo), .bar(bar[ii]));  
end  
endgenerate

## II. Algorithm to Architecture

### Solution Space

#### Antipodes

|                 | Algo | Arch            | Exec               | Datapath  | Strengths            |
|-----------------|------|-----------------|--------------------|-----------|----------------------|
| General purpose | any  | instruction set | fetch-load-execute | ALU       | Flexible, low cost   |
| Special purpose | fix  | dedicated       | dataflow           | dedicated | max perf., low power |

#### VLSI Suitability

1. loose coupling b/w major processing tasks
2. simple control flow
3. regular data flow
4. reusable storage
5. finite precision
6. non-recursive LTI
7. no transcendental (trigo, root, log...)
8. extensive operations available in instruction sets
9. no wide div/mult
10. throughput rather than latency

#### Other platforms

1. Dedicated satellites → specialized circuit → SA → MP
2. Host with helper engine → Host → Helper
3. ASIP: Application specific instruction set processor → Host → ALU → ISPU
4. RC: Reconfigurable computing → Host → reconfigurable co-processor
5. DSPP: Domain-specific programmable platform = combination of all the above, E.g. Zynq comp. efficiency → DSPP (RC, ASIP, GPMC) → agility

#### Dedicated Arch.

##### DDG: Data dependency graph

→ transport → latch

$$\text{Eqn: } y = I \dots \text{ DDG: } \begin{array}{c} \textcircled{1} \\ \textcircled{2} \\ \textcircled{3} \\ \textcircled{4} \end{array} \xrightarrow{\textcircled{5}} \begin{array}{c} \textcircled{6} \\ \textcircled{7} \\ \textcircled{8} \end{array}$$

#### Figures of Merit

$\Gamma$  cycles per data item  $t_{lp}$  longest path delay

$T$  time per data item  $\Theta$  throughput

$L$  latency

$A$  circuit size (GE: technology input)

$AT$  area-time prod.

$ET$  energy-time prod.

$$T = \Gamma \cdot T_{cp}$$

$$E = \frac{A}{\Gamma} = \frac{f_{cp}}{\Gamma}$$

## Equivalence Transforms

### Iterative decomposition d-times

$$\textcircled{1} \downarrow \frac{1}{d} \Rightarrow \textcircled{2} \downarrow \frac{1}{d} \leftarrow \text{FSM} \quad A_p \downarrow d + A_s + A_c \leq A_p + A_s + A_c$$

$$\Gamma = d \quad t_{lp} = \frac{t_p}{d} + t_r \quad L = d$$

- + good if repetitive computation
- bad if disparate function

### Pipeline by p

$$\textcircled{1} \downarrow \frac{1}{p} \Rightarrow \textcircled{2} \downarrow \frac{1}{p} \quad A = A_p + p \cdot A_r \quad \Gamma = 1 \quad t_{lp} \approx \frac{t_p}{p} + t_r \quad L = p$$

- + can dock np-times faster! - latency incr.
- + significantly more efficient - cannot always split exactly in half



### Replication by q

$$\textcircled{1} \downarrow \frac{1}{q} \Rightarrow \textcircled{2} \downarrow \frac{1}{q} \quad A = q \cdot (A_p + A_s) + A_{ch} \quad \Gamma = \frac{1}{q} \quad t_{lp} \approx t_p + t_r \quad L = 1$$

### Time Sharing

$$\textcircled{1} \downarrow \frac{s}{fgh} \Rightarrow \textcircled{2} \downarrow \frac{s}{fgh} \quad \max A_i + A_r + A_{ch} \leq A \leq \sum_{fgh} A_i + A_r + A_{ch} \quad \Gamma = s \quad t_{lp} \approx \max t_{pi} + t_r \quad L = s$$

- + good if all streams processed same way
- bad if subfunctions are disparate

### Associativity

$$\textcircled{1} \downarrow \textcircled{2} \downarrow \textcircled{3} \rightarrow \textcircled{4} \downarrow \textcircled{5} \rightarrow O(N) \quad O((cgN))$$

### Retiming

$$-o-c-o-c \rightarrow -o-o-t-o-$$

### Loop Unfolding

$$\rightarrow \begin{array}{c} \textcircled{1} \\ \textcircled{2} \\ \textcircled{3} \end{array} \rightarrow \begin{array}{c} \textcircled{4} \\ \textcircled{5} \\ \textcircled{6} \end{array} \rightarrow \begin{array}{c} \textcircled{7} \\ \textcircled{8} \\ \textcircled{9} \end{array}$$

| Summary  |     | Iteration Decap | Pipeline | Replicate | Timescale | Associativity | Refining | Loop Unfoldig | Better is |
|----------|-----|-----------------|----------|-----------|-----------|---------------|----------|---------------|-----------|
| A        | -/+ | =/+             | +        | -/+       | =         | =             | =        | +             | -         |
| T        | +   | =               | -        | +         | =         | =             | =        | =             | -         |
| $t_{lp}$ | -   | -               | =, una-  | =         | -/+       | -             | -        | -             | -         |
| T        | =   | -               | -        | +         | -/+       | -             | -        | -             | -         |
| AT       | -/+ | -/+             | 2        | =/+       | -/+       | -             | +        | -             | -         |
| L        | +   | +               | =, una+  | +         | =         | =             | +        | -             | -         |
| E        | -/+ | -/+             | =        | =/+       | -/+       | =             | +        | -             | -         |
| ET       | +/- | -               | -/+      | +         | -/+       | -             | -        | -             | -         |

### Temporary Data Stores

Options: On-chip (BRAM, LUT-RAM), Off-chip (SRAM, DRAM)

### Data access patterns

- RAMs: access one item after the other
- Registers: allow simultaneous access

Off-chip Extra parts, delays, routing, energy  
DRAM: memory refresh cycles

### Transform non-recursive computation

Def: result depends on present and past DFG cycle tree

$$\text{Refining: } \text{---} \rightarrow \text{---} \rightarrow \text{---} \Rightarrow \text{---} \rightarrow \text{---} \rightarrow \text{---}$$

Rules:

- Neither outputs nor sources of time-varying inputs may be part of a supervertex that is to be refined

When a supervertex is assigned a lag (lead) by  $\ell$  comp. cycles, the weights of all its incoming edges are in- (de-) incremented by  $\ell$

- No negative edge weights
- Any circular path must always include at least one edge of strictly positive weight



+ tip greatly reduced at zero HW area cost

### Transforming recursive computation

Def: Result depends on earlier outcomes: circles/loops in DFG

$$H(z) = \frac{Y(z)}{X(z)} = \frac{N(z)}{1 - aPz^{-1}} \quad N(z) = \sum a_n z^{-n} \quad \frac{Y}{X} = \frac{E^y}{1 - E^x z^{-1}}$$

- not suitable for higher order loops

Nonlinear FB-loops are in general not amenable to throughput multiplication by applying unfolding.

## VI HLS: High Level Synthesis

Idea: Use C/C++ to design VLSI

- Behavioral Description
- Code needs to base on synthesizable sub-set (static pointers, bounded arrays, no recursive functions, ...)
  - Timing directives required

Flow Basic Building Blocks: Functional units, storage elements, interconnect, control logic

- Steps:
1. Lexical Preprocessing
  2. Hdg optimization
  3. Control state flow graph analysis
  4. Resource allocation
  5. Library processing
  6. Scheduling
  7. Register allocation and binding
  8. Output processing

RTL designers for HLS HLS is not a push-button flow. Need to know HW to be able to program it.

## VII Computer Arithmetic Basics



### RCA: Ripple Carry Adder

is a Carry Propagate adder (CPA)



= Cycles of the carry: MSB output depends on LSB input  
t\_lp = n+1 stages

### Add multiple numbers



CSA principle: An array of CSA reduce the inputs to two, a final CPA merges the two. Performance mostly dictated by CPA.

### Multiplication

1. Partial product generation (simple shift)

2. Carry save tree



### Ops based on add



### Other Ops

Shift: Logical: shift and insert 0  
Arithmetic: shift and if right shift, fill with old MSB

Rotate: rotate circularly

Division: uncommon, mostly implemented iteratively using mult, comp, sub

Exp/Log/Tan: Use LUT

## Generate Propagah

Extend FA by two outputs: Generate =  $A \oplus B$  and Propagah =  $A \wedge B$   $G = A \oplus B$   $P = A \wedge B$   $S = P \cdot G$ ;  $C = G \cdot (C_{i-1}P)$

Then divide an  $N$ -bit adder into multiple  $n$ -bit adder



## CSKA: Carry Skip

$P=0$ : carry is generated in this part  
 $P=1$ : propagate carry from prev. block



## CIA: Carry Increment

$C_{in}$  is assumed 0. When carry is known, worst case: output is incremented



## CSLA: Carry Select

Two parallel adders, one assumes  $C_{in}=0$ , the other  $C_{in}=1$ . Once  $C_{in}$  is valid, the correct calc is selected.



## CLA: Carry lookahead

$$C_0 = C_i \quad C_1 = C_0 + P_0 C_i \quad C_2 = G_1 + P_1 G_0 + P_1 P_0 C_i$$

PPA: Parallel Prefix adders present a systematic approach to designing optimized adders. Stages: pre-processing, carry propagation, post-processing. Only CP is different



$(G_{i,j}^k, P_{i,j}^k)$  are the generate/propagate signals of bits  $i:j$  at propagate stage  $k$ . Initial is  $(G_{i,j}^0, P_{i,j}^0)$ .

Goal is to know all (GP) for all stages  $k$ .

merge two G,P ranges (includes logic)

feedthrough, no logic

Number of stages determines circuit area. Number of stages the critical path and max amount the stage delay.

Examples SK: Sklansky, BK: Brent-Kung, KS: Kashe-Stone

HC: Hvan-Carlsson

|      | A                    | T            | opt |
|------|----------------------|--------------|-----|
| RCA  | 7n                   | 2n           | aaa |
| SK   | $\frac{3}{2} \log n$ | $2 \log n$   | ttt |
| BK   | $10n$                | $4 \log n$   | att |
| KS   | $3 \log n$           | $2 \log n$   | -   |
| HC   |                      |              |     |
| CIA  | $10n$                | $2.8n^{1/2}$ | att |
| CSKA | $8n$                 | $4n^{1/2}$   | aat |
| CSLA | $14n$                | $2.8n^{1/2}$ | -   |

RCA: efficient

BK: compromise

SK: high fanout

KS, HC: fast but high routing

## Miscellaneous

### Mem FSM

//define states

```
enum logic [1:0] {a,b,c,d} state_d, state_q;
```

//comb. for next state

```
always_comb begin
```

//important! defaults

```
state_d = state_q;
```

```
data_d = data_q; //other intervals
```

```
case(state_q)
```

a: begin

```
if (!req_i) state_d = b;
```

```
end
```

b: begin

```
data_d = data_i;
```

```
if (!req_i) state_d = a;
```

```
end
```

endcase

end

//state register

```
always_ff @ (posedge clk_i, negedge rst_i) begin
```

if (!rst\_i) begin

```
state_q = a;
```

end else begin

```
state_q = state_d;
```

```
end
```

end

//out assign

```
assign ack_o = (state_q == b)? 1'S1 : 1'>0;
```