

day/date

1/2/23

## Digital System Design

### Recap

#### Half Adder

»» Truth Table

| A | B | f |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

$$»» f = A'B + AB' \quad (\text{SOP})$$



Design Optimization  
K-Maps, etc.

#### Full Adder

»» List out truth table

»» Write expression

»» Draw circuit

day/date



»» Textual (way of expressing problem)



»» Flow



- Simulation for verification
- Synthesis for netlist

/date

2/2/23

## Digital System Design

### Verilog

T.T  $\Rightarrow$  Expression  $\Rightarrow$  KMaps

Design

Methodology

Netlist

Implementation

```
module example (a,b,c,y)
```

```
    input a;
```

```
    input b;
```

```
    input c;
```

```
    output y;
```

```
endmodule
```

### Behavioral HDL

→ Writing gate level code for large scale designs is tedious

→ "assign" : assign f = x<sub>1</sub> & x<sub>2</sub>

## Digital System Design

### Levels of Abstraction

- Gate Level

- Gate primitives
- Need to have netlist / circuit in mind

- Behaviour



→ assign : combinational circuits

→ Procedural assignment statements : [if... else  
for  
case  
etc.]

→ always @(...)

↑ sensitivity list / event control

Example: Full-Adder

$$\text{sum} = a \otimes b \otimes c$$

$$\text{carry} = a.b + b.c + c.a$$

| c | b | a | carry sum |
|---|---|---|-----------|
| 0 | 0 | 0 | 0         |
| 0 | 0 | 0 | 1         |
| 0 | 1 | 0 | 0         |
| 1 | 0 | 1 | 1         |
| 1 | 1 | 0 | 0         |
| 0 | 1 | 1 | 1         |
| 1 | 1 | 1 | 1         |

## Digital System Design



Adder (Extension by connecting one bit wide adders)



Leaf node

make a single fulladder module and instantiate it 4 times, with corresponding inputs/outputs

- synthesizer looks at the whole structure and then outputs the circuit
- not sequential, unlike the compiler

→ loop should be in the always @() block

LHS variables should be of type reg

- for... loop index should be of type integer; cannot be anything else unless using generate
- generate .... end generate ; index of type genvar

- add bit is the stage name
- generate masks the gates in RTL viewer ; no structural changes, only the way it is presented
- keep type **reg** reserved for always @() blocks



## Levels of Abstraction



## Variables

- wire - intermediate connections
- reg - net output type
- genvar - loop index { generate block }
- integer - loop index { always block }

## Digital System Design

### Testbench

RTL      UPP  
 ↳ simulation ↴

- Does not typically have inputs / outputs.
- DUT must be instantiated in the testbench.

1. Declare module tb(); // No I/O
2. reg for input to DUT, wire for output
3. DUT instantiation
4. Initialization {initial begin ... end }



### Sequential Circuits

Combinational

+ = Sequential

Memory

## Digital System Design

- each always @(\*) block has a separate hardware footprint inferred by the synthesizer
- order of statements inside always @(\*) block matters
  - LHS = reg ; RHS in event control last statement to assignments in an always @(\*) block is executed, provided the reg on LHS is same
- Blocking Assignment (=)
  - Blocks progress ; Next waits for prev.
- Non-blocking Assignment (<=)
  - All assignments will be made at the same time

always @(\*)

Assume all inputs to be zero

begin

$$p = a \wedge b \quad // p = 0, 1$$

$$g = a \delta b \quad // g = 0, 0$$

$$s = p \wedge \text{cin} \quad // s = 0, 1$$

$$\text{cout} = g \mid (p \wedge \text{cin}) \quad // \text{cout} = 0, 0$$

end

- a goes to 1

With non-blocking, originally, the same outputs are yielded.

When a goes to 1, non-blocking statements observe a state delay of one cycle.

- Reserve blocking statements for combinational circuits.

## Digital System Design

### Sequential Circuits



D-latch and Flip Flop

O/P. In synchronous circuits, we always prefer ff's due to their ability to mitigate glitches.



→ always @ ( posedge clk  
negedge ... ) to make a clocked always block.

upon detecting either keyword, synth. will make LHS variables outputs of a flip flop

→ Latches can be inferred using an if statement, however, without else (selection)

Reset (Drive o/p to zero regardless of i/p)

└ Synchronous Reset

(Check reset on clock state)

└ A. Synchronous Reset

(Non - clocked counterpart)

## Shift Register

```
module shift_register (input [0:0] in,  
                      input clk, rst,  
                      output reg [0:0] out, output [3:0] Q);
```

ignore {  
 reg out; integer i;  
 always @ (posedge clk) begin  
 Q[i % 3] <= Q[i % 3 ~]

```
    reg out; reg Q;  
    always @ (posedge clk) begin  
        Q[1] <= in;  
        Q[2] <= Q[1];  
        Q[3] <= Q[2];  
        out <= Q[3];
```

end

endmodule

## Digital System Design



- Combinational circuit : Event control  
can be replaced by \*
- Sequential circuit : Explicitly state posedge / negedge in event control

## Finite State Machines

- next state logic
- state register
- output logic

Example: w : speed indication z : control



|        | Present | Input | Next |
|--------|---------|-------|------|
| ignore | A=Z     | w     | A=Z  |
| revise | 0       | 0     | 0    |
| FSMs   | 0       | 1     | 1    |
|        | 1       | 0     | 0    |
|        | 1       | 1     | 1    |

Moore: if output & present state only

Mealy: if output & present state + input

- initial is reserved for testbench / sim
- exception is for loading bits into memory using `$readmemb("file.txt", rom);`
- use templates for boiler-plate code : `Rom (clocked read template)`
- use IP catalog for blackbox models of complex blocks

/date

8/03/23



## Digital System Design

In always @ (posedge ...)

As many flops are inferred as  
unique number of LHS assignments

A lotta quizzes today maybe

It really do be like that when  
it be when it does. ; - ;

## Digital System Design

### Programmable Logic Devices

86



FPGA: offers reusability over discrete patching

L Design is about trade offs

- ~ More flexible but less performance
- ~ vice versa

L JTAG: Joint Test Access Group

## Digital System Design

### CPLD

Contains several PAL-like blocks (PLA with fixed OR plane)

- Interconnection : Architecture ensuring connection b/w PAL-like blocks
- PAL-like blocks are connected / placed in 2D fashion

### FPGA

- LB, SB and CB in 2D fashion
- Two LB's are connected using CB's
- Two CB's are connected using SB's
- Logic Block is fundamentally a truth table in silicon form  
 ↗ look-up table  
 I butchered the spellings, mb

### FPGA

#### LB

#### Dedicated Resources

#### μ Processor

main () { --- }

--- HOTSPOT : To be accelerated  
   (Exclusive circuit)

---

## Digital System Design



- ISA

↳ captured by HDL

↳ mapped to [ FPGA      ASIC ]



→ Hard logic (dedicated HW) will always (?) beat soft logic performance

# Digital System Design

:-)  
o\_o  
u\_w

Gotoha!

```
if ... else
→ always @(*)
begin
  if s == 0
    a = b
  else
    a = c
end
```

vs.

```
case
→ always @(*)
```

```
begin case(s)
  1'b0: a = b;
  1'b1: a = c;
  default: a = xx;
endcase
end
```

if there are more than two ~ i.e presence of if ... else if ... else : a priority encoder is inferred

| in      | out |
|---------|-----|
| 0 0 0 0 | xx  |
| 0 0 0 1 | 0 0 |
| 0 0 1 x | 0 1 |
| 0 1 x x | 1 0 |
| 1 x x x | 1 1 |

case [ case x ] { Dont use these lit. cases }  
 [ case z ] — guess they bad ~

else absence. { LATCH }



combinatorial feedback path

} incomplete if else or missing default

/date

## Circuit Timing

[ Delay { 90% - 10% , 50% - 50%  
tr. time it takes to reach  
10% → to 90%.

## Digital System Design

### Path Delay & Performance Analysis

- Critical Path
- Metastability
- $F_{max}$
- Circuit Timing

### Operators:

$\{ +, -, \times, / \} \rightarrow$  Arithmetic Circuits

root is of all operators  
addition

Gate Delay occurs due to stray capacitances

made by overlap ←  
of two materials (n-p juncts.)

Longest propagation delay in the  
circuit {attributed to output ?}



$$\text{delay} = n \times \text{fundamental delay}$$

To synchronize the circuit, we insert flip-flops clocked by a clock with period that of the longest path (critical path) {also fixes race conditions ?}

## Critical Path



Longest Path b/w the source and destination flops

$$F_{max} = \frac{1}{\text{critical Path}}$$

Global  $F_{max}$  is the largest  $F_{max}$  of individual blocks in the design

$$CP_{global} \quad \{ \max(CP_1, CP_2, \dots, CP_n) \}$$

$$F_{max}_{global} \quad \{ \min(F_{max_1}, F_{max_2}, \dots, F_{max_n}) \}$$

$T_{min}$  = minimum clock period

↳ short path delay } misinterp.



P1 : SFF  $\rightarrow$  AND 3  $\rightarrow$  D  $\rightarrow$  D  $\rightarrow$  NOT 1  $\rightarrow$  D  $\rightarrow$  DFF

P2 : SFF  $\rightarrow$  AND 1  $\rightarrow$  AND 2  $\rightarrow$  NOT 1  $\rightarrow$  D  $\rightarrow$  DFF

SFF's Posedge : Sampling edge

DFF's Posedge : Capturing edge

## Digital System Design



If  $D$  changes within  $t_{\text{setup}} - t_{\text{hold}}$ , the system becomes metastable.

$t_{\text{csa}}$ : delay - it takes for  $Q$  to appear after the sampling edge

## Digital System Design

$t_{\text{setup}}$  &  $t_{\text{hold}}$ : timing constraints  
 $t_{\text{c}\rightarrow\text{o}}$ : delay

$$\rightarrow t_{\text{clock min}} \geq \underbrace{\text{Critical path}}_{\substack{\text{clock} \\ \text{period}}} + \underbrace{\text{delay of} \\ \text{comb. circuit}}$$



### SFF

- $t_{\text{setup}}$
- $t_{\text{hold}}$
- $t_{\text{c}\rightarrow\text{o}}$

we're only interested in o/p of SFF if it has been sampled already

### DFF

- $t_{\text{setup}}$
- $t_{\text{hold}}$
- $t_{\text{c}\rightarrow\text{o}}$

### Updated condition

$$\rightarrow t_{\text{clock min}} \geq \text{Critical path} + t_{\text{setup(dest)}} + t_{\text{c}\rightarrow\text{o}(\text{source}) \max}$$

$$t_{\text{c}\rightarrow\text{o}} = \begin{cases} \min & \text{as } 0 \rightarrow 1 \\ \max & \text{as } 1 \rightarrow 0 \end{cases}$$

$0 \rightarrow 1$  transitions do not take the same time ~

:( :D :( } my mixed feelings about choosing  
DSP as elective; it do be like that



constraints

$t_{\text{setup}}$  } to check for constraint violation,  
 $t_{\text{hold}}$  focus on DFF's.

if it's the actual max path,  
max path timing constraint } no violations can take place

$$t_{\min} \geq T_{\text{op}} + t_{c \rightarrow d(\text{SFF})} + t_{\text{setup(DFF)}}$$

### Example

$$\left. \begin{array}{l} t_{\min} : 5\text{ns} \\ T_{\text{op}} : 3\text{ns} \\ t_{c \rightarrow d(\text{max})} : 1\text{ns} \end{array} \right\} t_{\text{setup(DFF)}} = 1\text{ns}$$

If combination of  $T_{\text{op}}$  and  $t_{c \rightarrow d}$  is greater than  $t_{\text{setup}}$ ; no setup violation occurs

- Y else wise system is metastable
- E  $t_{\text{setup}}$ : max path constraint
- E  $t_{\text{hold}}$ : min path constraint

» check setup violation on the max path } if setup ≠ violated; hold ≠ violated as well

» faster paths have a higher odd of violating setup and hold constraints.

Example

From slides :-

$$t_{min} \geq T_{op} + t_{su} + t_{cmax}$$

$$t_{min} \geq 2.7 \text{ ns}$$

hold time violation: (no violation condition)

$$t_h \leq \underbrace{T_{op} + t_{cmin}}_{\substack{\text{shortest} \\ \text{possible delay}}}$$

### Fixing Violations

Setup Violation: Slow down CTR ; move registers around ~

Hold Violation: Add BUFFERS

Class Exercise : (Same timing as in previous example )

$$t_{gate} = 1.2 \text{ ns} \quad t_{su} = 0.6 \text{ ns} \quad t_h = 0.4 \text{ ns}$$

$t_{c} : 0.8 \text{ ns} \rightarrow 1 \text{ ns}$

CP = Q<sub>0</sub> → AND<sub>1</sub> → AND<sub>2</sub> → AND<sub>3</sub> → XOR<sub>4</sub> → Q<sub>3</sub>

$$\underline{t_{min} \geq T_{op} + t_{cmax} + t_{su}}$$

$$\underline{t_{min} \geq 6.4 \text{ ns}}$$

$$\underline{F_{max} = 156.25 \text{ MHz}}$$

No violation  
whatever

## Hold Time Violation

$$t_h < T_{op} + t_{margin}$$

$$t_h < 1.2 + 0.8$$

$$t_h < 2 \text{ ns}$$

Since,  $t_h = 0.4 \text{ ns}$ ; no hold time violation



## Digital System Design

### Metastable



### Timing Violations

Common Situations → Async. Input Signals → cause of metastability

→ Slower convergence when moving

near the metastable state.

Higher deviations when moving away from stable states.

Meta stable level may resolve to 0 or 1.

→ If a meta level is output from the source ff, everything's well if value settles correctly before the capture edge.

$$\text{Slack time} = \text{Capture edge} - \text{Settling time}$$

## Register Chain (Synchronization)



- Common technique to -
- Add synchronizers to all async interfaces
- Heuristically adopted approach to use two - three ff's as it increases MTBF exponentially



- Addr:  $[n-1:0]$
- Data:  $[n-1:0]$  → Sync chains work fine for single bit signals
- Write: 1'b
- Read: 1'b
- Added complexity / Hardware inefficient to implement sync chain for every bit
- Use FIFO (first-in - first-out) ~ using



## Digital System Design

### Arithmetic Circuits

- └ Addition
- └ Subtraction
- └ Multiplication
- └ Division

### → Combinational Blocks

- └ Results available in one clock cycle

### → Multi-Cycle

- └ Compute result over multiple clock cycles

### → 1-bit Half Adder



Does not ~  
accommodate for  
Carry  
of previous stage

### → Full Adder

Still 2-operands, however,  
with carry accommodated for



To perform  $A - B$ ; set  $C_{in} = 1$  and negate  $b$ 's

To perform both; use XOR with  $b$ 's and  $C_{in}$  as control

→ Delay for N-bit Carry Propagate Adder

$$2(N-1)t_{pd} + 3t_{pd}$$

proportional to  $N$

Carry - Select Adder



$$A_0 : A_3 - 0 \quad A_1 : A_7 - 4$$

$$B_0 : B_3 - 0 \quad B_1 : B_7 - 4$$

Increased performance at the expense of Area

$$41\% \downarrow t_{pd}$$

$$\text{Area } \uparrow 50\%$$

## Carry Look-Ahead Adder

$$C_{i+1} = g_i + p_i \cdot c_i$$

where  $g_i = a_i \cdot b_i$  Generate  
 $p_i = (a_i + b_i)$  Propagate

- If  $g_i$  is 1 ;  $c_{i+1}$  is predicted to be 1
- If  $g_i \neq 1$ , and either A or B is one,  
 $c_{i+1}$  is the same as previous  $c_i$ .

$g_i$  &  $p_i$  are independent of carry terms  
only the inputs

All outputs ready after  $L_{tpD}$ .  
Delay remains same after even considering  
n-bit adder.

| BitW | CPA | CSA | CLA |
|------|-----|-----|-----|
| 4    | 9   | 10  | 4   |
| 8    | 17  | 10  | 4/5 |
| 16   | 33  |     |     |
| 32   | 65  |     |     |
| 64   | 129 |     |     |

.. 4 if assuming  
no fan-in problem  
5 if arranging  
CLA as CSA.

## Multiplication

→ Multiplication is an AND operation.

$N$  bit  $\times N$  bit  
 $\downarrow 2N$  output bits

$P = A^* B \rightarrow P$  must have twice the bits of A and B

$\rightarrow$  Dedicated Hardware Mapping: {Hard Logic}  
 $\hookrightarrow$  Default  $F_{max} = 1$  GHz

{ DSP Multiplier :  $F_{max} = 250$  MHz }

Combinational implementation's relation  
with  $F_{max}$ ;  $\sim$  in Lecture

### Serial Multiplier (Multi-Cycle)



Registers also need load  $\{ P=0 ; A=\sim B=\sim \}$



Data path : { z : zero detect ;  
b<sub>0</sub> : B at 0<sup>th</sup> index }

Control path : { shiftA, loadA, shiftB  
loadB, loadP, selP }



~ Fixed Point      } Arithmetic (skippable or just)  
~ Floating Point    } general idea

~  $\mu P$  vs. FPGA

parallelism

needs task to be parallelized be same /



Take the longest time of workers ;  
analogous to longest path in FFs

Takes 25 min for each big task

$$\bar{T} = \{ T_1, T_2, T_3, T_4, T_5 \}$$

Continued on next DSD page

## Digital System Design

Continued



Overall pipelining gain is  $\approx$  to the number of stages

Example



Sequential:  $3 \times 3 \text{ hrs} = 9 \text{ hrs}$

Pipeline:  $3 + 2(1) \text{ hrs} = 5 \text{ hrs}$

gain  $\approx$  number of pipeling stages

Pipelining helps throughput the most